A target vehicle detection method and device, electronic equipment and readable storage medium

CN122307553APending Publication Date: 2026-06-30CHINA FAW CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
CHINA FAW CO LTD
Filing Date
2026-03-31
Publication Date
2026-06-30

Smart Images

  • Figure CN122307553A_ABST
    Figure CN122307553A_ABST
Patent Text Reader

Abstract

This application provides a target vehicle detection method, apparatus, electronic device, and readable storage medium. The method acquires image data and radar data; extracts visual features of the target vehicle from the image data to obtain a visual feature map; and extracts point cloud clustering features and micro-Doppler features of the target vehicle from the radar data. The visual feature map, point cloud clustering features, and micro-Doppler features are fused to obtain a fused feature vector. The fused feature vector from multiple consecutive frames is input into a trained sequence model to output the motion state of the target vehicle. At least one supplementary piece of evidence is acquired, and the probability of the target vehicle's existence is calculated based on this evidence. A virtual target of the target vehicle is generated based on the probability of existence and the motion state. Thus, by fusing local visual cues and radar features, and combining a sequence model with multi-evidence reasoning, active detection and state estimation of occluded target vehicles are achieved, improving the accuracy of target detection for autonomous vehicles.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of autonomous driving technology, and in particular to a target vehicle detection method, device, electronic device, and readable storage medium. Background Technology

[0002] With the rapid development of autonomous driving technology, environmental perception systems, as the cornerstone of autonomous driving decision-making and control, directly impact driving safety and comfort. Accurately detecting surrounding vehicles, especially smaller, highly maneuverable two-wheeled vehicles, is crucial for improving the adaptability of autonomous driving systems in complex traffic scenarios.

[0003] In existing technologies, autonomous vehicles typically employ visual sensors and millimeter-wave radar for target detection. Visual sensors use deep learning models to identify vehicle outlines, while millimeter-wave radar determines target positions by detecting point cloud echoes. However, when a two-wheeled vehicle is traveling close behind a large vehicle, visual sensors struggle to separate the two-wheeled vehicle from the outline of the large vehicle. Radar sensors, due to the small reflective cross-section of the two-wheeled vehicle and the mixing of its echo with the strong echo of the large vehicle, tend to filter it out as noise or misclassify it as part of the large vehicle. This results in low target detection accuracy for autonomous vehicles in severely occluded scenarios. Summary of the Invention

[0004] In view of this, embodiments of this application provide at least one target vehicle detection method, device, electronic device, and readable storage medium. By fusing visual local cues and radar features, and combining sequence models and multi-evidence reasoning, active detection and state estimation of occluded target vehicles are achieved, thereby improving the accuracy of target detection for autonomous vehicles.

[0005] This application mainly includes the following aspects: In a first aspect, embodiments of this application provide a target vehicle detection method, the method comprising: Acquire image data collected by the vehicle's image sensor and radar data collected by the radar sensor; Visual features of the target vehicle are extracted from the image data to obtain a visual feature map, and point cloud clustering features and micro-Doppler features of the target vehicle are extracted from the radar data; the visual feature map indicates local cue regions in the image that are associated with the target vehicle. The visual feature map, the point cloud clustering feature, and the micro-Doppler feature are fused to obtain a fused feature vector. The fused feature vector of multiple consecutive frames is then input into a trained sequence model to output the motion state of the target vehicle. Obtain at least one piece of supplementary evidence, and calculate the probability of the existence of the target vehicle based on the at least one piece of supplementary evidence; A virtual target of the target vehicle is generated based on the existence probability and the motion state; the virtual target includes the identifier, position and speed of the target vehicle.

[0006] Secondly, embodiments of this application also provide a target vehicle detection device, the target vehicle detection device comprising: The data acquisition module is used to acquire image data collected by the vehicle's image sensor and radar data collected by the radar sensor; The feature extraction module is used to extract the visual features of the target vehicle from the image data to obtain a visual feature map, and to extract the point cloud clustering features and micro-Doppler features of the target vehicle from the radar data; the visual feature map indicates the local cue regions in the image that are associated with the target vehicle; The state tracking module is used to fuse the visual feature map, the point cloud clustering features and the micro-Doppler features to obtain a fused feature vector, and input the fused feature vector of multiple consecutive frames into the trained sequence model to output the motion state of the target vehicle. A probability calculation module is used to acquire at least one piece of supplementary evidence and calculate the probability of the existence of the target vehicle based on the at least one piece of supplementary evidence. The target generation module is used to generate a virtual target of the target vehicle based on the existence probability and the motion state; the virtual target includes the identifier, position and speed of the target vehicle.

[0007] Thirdly, embodiments of this application also provide an electronic device, including: a processor, a memory, and a bus. The memory stores machine-readable instructions executable by the processor. When the electronic device is running, the processor communicates with the memory through the bus, and the machine-readable instructions are executed by the processor to perform the steps of the target vehicle detection method as described above.

[0008] Fourthly, embodiments of this application also provide a computer-readable storage medium storing a computer program, which, when executed by a processor, performs the steps of the target vehicle detection method described above.

[0009] This application provides a target vehicle detection method, apparatus, electronic device, and readable storage medium. The method includes: acquiring image data collected by a vehicle image sensor and radar data collected by a radar sensor; extracting visual features of the target vehicle from the image data to obtain a visual feature map, and extracting point cloud clustering features and micro-Doppler features of the target vehicle from the radar data; the visual feature map indicates local cue regions in the image associated with the target vehicle; fusing the visual feature map, point cloud clustering features, and micro-Doppler features to obtain a fused feature vector, and inputting the fused feature vector of multiple consecutive frames into a trained sequence model to output the motion state of the target vehicle; acquiring at least one supplementary evidence and calculating the existence probability of the target vehicle based on the at least one supplementary evidence; generating a virtual target of the target vehicle based on the existence probability and motion state; the virtual target includes the target vehicle's identifier, position, and speed. Thus, by fusing visual local cues and radar features, and combining a sequence model with multi-evidence reasoning, active detection and state estimation of occluded target vehicles are achieved, improving the accuracy of target detection for autonomous vehicles.

[0010] To make the above-mentioned objectives, features and advantages of this application more apparent and understandable, preferred embodiments are described below in detail with reference to the accompanying drawings. Attached Figure Description

[0011] To more clearly illustrate the technical solutions of the embodiments of this application, the accompanying drawings used in the embodiments will be briefly introduced below. It should be understood that the following drawings only show some embodiments of this application and should not be regarded as a limitation of the scope. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.

[0012] Figure 1 A flowchart of a target vehicle detection method provided in an embodiment of this application is shown; Figure 2 This illustration shows one of the functional block diagrams of a target vehicle detection device provided in an embodiment of this application; Figure 3 This is a second functional block diagram of a target vehicle detection device provided in an embodiment of this application; Figure 4 A schematic diagram of the structure of an electronic device provided in an embodiment of this application is shown. Detailed Implementation

[0013] To make the objectives, technical solutions, and advantages of the embodiments of this application clearer, the technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. The components of the embodiments of this application described and shown in the accompanying drawings can generally be arranged and designed in various different configurations. Therefore, the following detailed description of the embodiments of this application provided in the accompanying drawings is not intended to limit the scope of the claimed application, but merely represents selected embodiments of this application. All other embodiments obtained by those skilled in the art based on the embodiments of this application without inventive effort are within the scope of protection of this application.

[0014] To facilitate understanding of this application, the technical solutions provided in this application will be described in detail below with reference to specific embodiments.

[0015] Please see Figure 1 , Figure 1 This is a flowchart illustrating a target vehicle detection method provided in an embodiment of this application. Figure 1 As shown, the target vehicle detection method provided in this application includes the following steps: S101, acquire image data collected by the vehicle's image sensor and radar data collected by the radar sensor.

[0016] Here, the image sensor is used to acquire visual image information of the area in front of the vehicle, while the radar sensor is used to acquire radar point cloud and raw signal information of the area in front of the vehicle. By acquiring sensor data from two different modalities, complementary information sources can be provided for subsequent target detection.

[0017] In this embodiment, the image sensor mounted on the vehicle can be a monocular camera, a binocular camera, or a surround-view camera, used to acquire image sequences of the area in front of the vehicle in real time. The radar sensor mounted on the vehicle can be a millimeter-wave radar, used to emit electromagnetic waves and receive echo signals, generating point cloud data containing target distance, speed, and angle information, as well as raw I / Q signals. The two sensors work together to provide a data foundation for subsequent feature extraction.

[0018] S102, extract the visual features of the target vehicle from the image data to obtain a visual feature map, and extract the point cloud clustering features and micro-Doppler features of the target vehicle from the radar data; the visual feature map indicates the local cue regions in the image associated with the target vehicle.

[0019] Here, the purpose of visual feature extraction is to find local clues in the image that can suggest the existence of the target vehicle, rather than the complete target outline; the purpose of radar feature extraction is to identify candidate targets from sparse point clouds and analyze their micro-motion features to determine the existence of the target.

[0020] In this embodiment, visual features are extracted from image data to obtain a visual feature map used to indicate local cue regions. Point cloud clustering features and micro-Doppler features are extracted from radar data, wherein the point cloud clustering features reflect the geometric distribution information of the radar target, and the micro-Doppler features reflect the micro-motion characteristics of the target.

[0021] S103, the visual feature map, the point cloud clustering feature and the micro-Doppler feature are fused to obtain a fused feature vector, and the fused feature vector of multiple consecutive frames is input into the trained sequence model to output the motion state of the target vehicle.

[0022] Here, the purpose of feature fusion is to process visual features and radar features in a coordinated manner, so that the two can verify and enhance each other; the sequence model is used to capture the motion pattern of the target in the time dimension and maintain the continuity of tracking.

[0023] In this embodiment, feature fusion includes collaboration in two directions: firstly, local cue regions in the visual feature map are used as spatial constraints to limit the focus range of radar feature extraction, enabling radar weak target detection to focus on suspicious areas; secondly, point cloud clustering features and micro-Doppler features are used as motion verification to determine whether the local cue regions in the visual feature map belong to real moving targets. The fused feature vector is input into a sequence model, which can be an LSTM network or a Transformer network. By analyzing the feature vector sequence of multiple consecutive frames, it outputs motion state information such as the current position and speed of the target vehicle.

[0024] S104, obtain at least one supplementary piece of evidence, and calculate the probability of the existence of the target vehicle based on the at least one supplementary piece of evidence.

[0025] Here, supplementary evidence refers to auxiliary information obtained from other dimensions to enhance the credibility of the target's existence assessment. By fusing and calculating multiple pieces of evidence, the probability of the target vehicle's existence can be obtained.

[0026] In this embodiment, supplementary evidence is used to verify the existence of the target vehicle from different perspectives, including but not limited to one or more of speed difference evidence, free space evidence, and external communication evidence. The probability of the target vehicle's existence is calculated based on the acquired supplementary evidence; a higher probability indicates a greater likelihood that the target vehicle actually exists.

[0027] S105, Generate a virtual target of the target vehicle based on the existence probability and the motion state; the virtual target includes the identifier, position and speed of the target vehicle.

[0028] Here, when the probability of existence is high enough, the system will proactively construct a virtual target to represent the occluded target vehicle for use by the subsequent planning and control module.

[0029] In this embodiment, when the calculated probability of existence exceeds a preset threshold, the system confirms the existence of the target vehicle and assigns a unique identifier to the target based on its motion state information. Simultaneously, it determines the estimated position and speed, forming a complete virtual target. This virtual target is output to the planning and control module of the autonomous driving system as part of the perception results for subsequent decision-making and control, such as adopting defensive driving strategies, slowing down in advance, or increasing following distance to address potential risks.

[0030] The following is a specific example to illustrate this: Taking an urban expressway scenario as an example, a car is traveling at 80 km / h, a van is traveling at 75 km / h ahead, and an electric bicycle is following closely behind the van, completely invisible from the car's perspective. The car collects image data via a camera and radar data via millimeter-wave radar. Visual features are extracted from the image data to obtain a visual feature map indicating the local cue region; point cloud clustering features and micro-Doppler features are extracted from the radar data. The visual feature map, point cloud clustering features, and micro-Doppler features are fused to obtain a fused feature vector. The fused feature vector from multiple consecutive frames is input into a sequence model to output the motion state of the electric bicycle. Simultaneously, supplementary evidence is acquired, and the probability of existence is calculated based on the supplementary evidence. A virtual target is generated based on the probability of existence and the motion state. This virtual target includes the electric bicycle's identifier, location, and speed, used to represent the occluded electric bicycle.

[0031] Further, the step of extracting the visual features of the target vehicle from the image data to obtain a visual feature map includes: Step a1: Input the image data into a convolutional network, and extract features of the ground area at the bottom of the vehicle in front, the two side edge areas of the rear of the vehicle in front, and the projection and shadow areas of the ground behind the vehicle in front based on the attention module in the convolutional network.

[0032] Here, since the occluded target vehicle is usually hidden behind the vehicle in front, its visible parts often appear in the gap between the bottom of the vehicle and the ground, the sides of the rear, or form unusual shadows in the ground projection. By focusing on these specific areas with an attention module, local cues of the target vehicle can be effectively captured, avoiding indiscriminate feature extraction across the entire image.

[0033] In this embodiment, the convolutional network can use a backbone network such as ResNet or EfficientNet, and the attention module can employ a combination of spatial attention and channel attention mechanisms, such as SE-Net or CBAM. Through training, this attention module automatically allocates computational resources to the three key regions: the contact area between the bottom of the vehicle and the ground, the edge regions on both sides of the rear of the vehicle, and the projection and shadow regions on the ground behind the vehicle, thereby extracting feature representations of these regions.

[0034] To capture dynamic cues, the convolutional network uses 3D convolution to extract spatiotemporal features from multiple consecutive frames of images, or uses 2D convolution combined with optical flow input to input image frames and optical flow maps into the network. The optical flow map is used to indicate minute movements at the rear edge of the vehicle ahead, and even if the moving object itself is not visible, its dynamic features can still be captured through the optical flow map.

[0035] Step a2: The extracted features are generated into a salient region heatmap, which serves as the visual feature map; wherein, the training supervision signal of the salient region heatmap is the local region annotation related to the target vehicle.

[0036] Here, transforming the extracted features into salient region heatmaps makes the feature representation more intuitive and facilitates subsequent fusion with radar features. Simultaneously, by introducing local region annotations as supervisory signals, the network can learn which local regions are highly correlated with the presence of the target vehicle.

[0037] In this embodiment, the features extracted in step a1 are encoded and mapped to generate a salient region heatmap corresponding to the input image size. The pixel values ​​in this heatmap represent the probability of a local clue indicating the presence of a target vehicle at that location. During the training phase, manually annotated local region heatmaps are used as supervision signals. The annotation method involves marking local areas in the image that suggest the presence of a target vehicle, such as wheels, handlebars, helmets, and unusual shadows, to generate ground truth heatmaps. The network is optimized through training by minimizing the difference between the predicted heatmap and the ground truth heatmap.

[0038] Following the aforementioned urban expressway scenario, in the image data, the convolutional network automatically focuses on the ground contact area at the bottom of the truck through an attention module. In this area, the network detects rapidly passing clusters of bright, arc-shaped pixels, the shape of which is highly similar to the edge of a bicycle's front wheel. Simultaneously, the attention module captures the momentary exposure of the top of a helmet in the right-side rear edge area of ​​the truck. These features are generated into a salient region heatmap, in which the bottom area of ​​the truck exhibits a high response value, serving as the output visual feature map. This visual feature map indicates local cue regions where a target vehicle may exist, providing spatial priors for subsequent radar feature extraction.

[0039] Further, the extraction of point cloud clustering features of the target vehicle from the radar data includes: Step b1: Cluster the radar data and identify point cloud clusters that are less than a first threshold in distance from each other and have more than a second threshold in number as candidate targets.

[0040] Here, due to the small radar cross-section of small targets such as two-wheeled vehicles, their echo point clouds are usually few in number and easily mixed with surrounding clutter. Traditional clustering algorithms typically set a minimum point count threshold, treating point cloud clusters with fewer points as noise and filtering them out. Furthermore, they are prone to incorrectly merging sparse point clouds that are close together into large point cloud clusters of vehicles ahead, leading to the loss of two-wheeled vehicle targets. This application's embodiment uses clustering with both distance and point count thresholds, enabling the separation of sparse but spatially clustered point cloud clusters from noise and identifying them as candidate targets.

[0041] In this embodiment, a clustering algorithm sensitive to low-density point clouds, such as the DBSCAN algorithm, is used to cluster the radar point cloud. The first threshold corresponds to the neighborhood distance parameter eps in DBSCAN, which is calibrated based on the physical size of the target vehicle and the radar point cloud resolution. The second threshold corresponds to the minimum core point count parameter MinPts in DBSCAN, which is set to a low value to retain candidate targets with fewer points. Through the above clustering process, point cloud clusters whose distances to each other are less than the first threshold and whose point count is greater than the second threshold are identified as candidate targets. The DBSCAN algorithm can discover clusters of arbitrary shapes, is insensitive to noise points, and does not require pre-specifying the number of clusters, making it particularly suitable for identifying sparse point clouds of small targets such as two-wheeled vehicles.

[0042] Furthermore, before clustering, the radar data is first filtered to remove background and static clutter, and special attention is paid to the dynamic point cloud within a preset distance range behind the rear outline of the vehicle in front, in order to focus on the area where the target vehicle is most likely to be hidden.

[0043] Step b2 involves associating point cloud clusters belonging to the same candidate target across multiple consecutive frames to obtain associated trajectories.

[0044] Here, candidate targets detected in a single frame may be missed or false alarms. By associating point cloud clusters belonging to the same target in multiple consecutive frames, a temporally continuous trajectory can be formed, improving the credibility of the target's existence.

[0045] In this embodiment, a trajectory association algorithm is used to match candidate targets across multiple consecutive frames. Specifically, based on the position and motion state of a candidate target in the previous frame, its possible position in the current frame is predicted, and candidate targets in the current frame are searched for near the predicted position. If a matching candidate target is found, it is associated with the target in the previous frame to form an associated trajectory. Through association across multiple consecutive frames, false alarm targets that appear occasionally can be filtered out, while persistent real targets are retained.

[0046] Step b3: Filter the associated trajectory to obtain the point cloud clustering features.

[0047] Here, due to measurement noise inherent in the radar point cloud, the correlated trajectory may exhibit jitter or discontinuity. Filtering can smooth the trajectory, yielding a more accurate target motion state.

[0048] In this embodiment, a filtering algorithm, such as Kalman filtering or particle filtering, is used to smooth the associated trajectory. Based on the target's historical motion state and current measurements, the filtering algorithm optimally estimates the target's position, velocity, and other states, outputting the smoothed trajectory as a point cloud clustering feature. This feature reflects the target's motion trajectory over continuous time.

[0049] Furthermore, after obtaining the associated trajectory, a motion consistency test is performed, including determining whether the motion smoothness of the trajectory conforms to vehicle dynamics, whether the relative positional relationship between the trajectory and the vehicle in front is reasonable, and whether the continuity of the trajectory meets the requirement of existing in multiple consecutive frames. Only trajectories that pass the test are used as the final point cloud clustering feature output.

[0050] Following the aforementioned urban expressway scenario, the DBSCAN algorithm was used to cluster point clouds in the radar data, with a first threshold of 0.5 meters and a second threshold of 3 points. A point cloud cluster consisting of 4 points was identified 1.5 meters behind the truck's rear, and this cluster was identified as a candidate target. In three consecutive frames, this point cloud cluster was successfully matched within the predicted location range, forming a related trajectory. Kalman filtering was used to smooth the trajectory, resulting in stable point cloud clustering features. These features reflect the candidate target's trajectory over continuous time, with an estimated speed of approximately 77 km / h.

[0051] Furthermore, the extraction of the target vehicle's micro-Doppler features from the radar data includes: Step c1: Perform time-frequency transformation on the radar data to obtain a spectrum diagram.

[0052] Here, the echo signal received by the radar sensor includes not only the reflected signal from the target body, but also the micro-motion modulation signal generated by the moving parts of the target (such as wheel rotation or limb swing). These micro-motion signals are difficult to identify directly in the time domain. By converting them to the time-frequency domain through time-frequency transformation, the micro-motion characteristics can be clearly observed.

[0053] In this embodiment, the raw I / Q signal acquired by the radar sensor undergoes time-frequency transformation, for example, using short-time Fourier transform or wavelet transform, to convert the time-domain signal into a time-frequency domain signal, generating a spectrum. The horizontal axis of this spectrum represents time, the vertical axis represents frequency, and the pixel value represents signal energy. In the spectrum, the target body is typically represented as a relatively stable, continuous horizontal color band, while the micro-moving parts are represented as a periodically changing curved pattern above and below the target body's color band.

[0054] Step c2: Extract the periodically changing curve pattern from the spectrum as the micro-Doppler feature.

[0055] Here, the rotation of the wheels of a two-wheeled vehicle or the swaying of the driver's limbs produce a periodic micro-Doppler effect, which appears as a periodically changing curved pattern in the spectrum. By extracting these patterns, it is possible to determine whether the target has the characteristics of moving parts, thereby aiding in the identification of the target type.

[0056] In this embodiment, image processing is performed on the generated spectrogram to identify and extract periodically changing curve patterns. Specifically, it can detect whether there is a curve in the spectrogram that swings up and down at a fixed frequency, the frequency of which is related to the wheel rotation speed or the limb swinging frequency. If the aforementioned periodically changing curve pattern is extracted, it is used as a micro-Doppler feature to characterize the presence of moving parts in the target, such as rotating wheels or swinging limbs. In the spectrogram, the main body of the vehicle in front usually appears as a relatively stable, continuous horizontal color band, while the obscured two-wheeled vehicle, due to wheel rotation or limb swinging, will form strong and periodically flashing curve patterns on both sides of this horizontal color band.

[0057] Following the aforementioned urban expressway scenario, a short-time Fourier transform is performed on the radar signals in the candidate target area from the raw I / Q signals acquired by the radar sensors to obtain a spectrogram. In this spectrogram, the truck in front appears as a stable horizontal color band, while below it, a periodically oscillating curved pattern appears. The frequency of this pattern is approximately 5Hz, consistent with the typical frequency of an electric bicycle wheel rotation. This periodic curved pattern is extracted as a micro-Doppler feature to determine the presence of a moving target with rotating wheels in the area.

[0058] Further, the feature fusion of the visual feature map, the point cloud clustering features, and the micro-Doppler features includes: Step d1: Use the local cue regions in the visual feature map as spatial constraints to limit the extraction range of the point cloud clustering features and the micro-Doppler features.

[0059] Here, the visual feature map indicates local clue regions in the image where target vehicles may exist; these regions correspond to specific spatial locations. Using this spatial location information as a constraint can guide radar feature extraction to focus on these suspicious regions, avoiding invalid calculations in irrelevant areas and reducing the false alarm rate.

[0060] In this embodiment, the high-response regions of the salient region heatmap in the visual feature map are mapped onto the radar coordinate system to obtain a spatially constrained region. In subsequent point cloud clustering and micro-Doppler feature extraction processes, only the radar point cloud and radar signal within this spatially constrained region are processed, thereby limiting the scope of radar feature extraction to the suspicious regions indicated by visual cues, achieving spatial prior guidance of radar based on visual perception.

[0061] Step d2: Using the point cloud clustering features and the micro-Doppler features as motion verification, determine whether the local cue regions in the visual feature map belong to moving targets.

[0062] Here, the local cue regions in the visual feature map may originate from real target vehicles, or from interfering factors such as changes in light and shadow, reflections, etc. By introducing radar features as motion verification, the accurate speed measurement capability of radar can be used to determine whether there are real moving targets in the region, thereby filtering out false visual cues.

[0063] In this embodiment, the point cloud clustering features and micro-Doppler features extracted in step d1 are associated with the local cue regions in the visual feature map. If point cloud clustering features are successfully extracted within the spatially constrained region, and these features indicate the presence of a target with a continuous motion trajectory; or if micro-Doppler features are successfully extracted, and these features indicate the presence of a periodically moving component, then the local cue region in the visual feature map is determined to be a real moving target. Conversely, if no valid radar features are extracted within the region, the visual cue is determined to be a false alarm and is filtered out.

[0064] Following the aforementioned urban expressway scenario, the ground contact area at the bottom of the truck in the visual feature image exhibits a high response value, which is mapped as a spatially constrained region. Radar feature extraction was performed within this spatially constrained region, successfully identifying a point cloud cluster composed of four point clouds and extracting a periodically changing micro-Doppler curve pattern. Point cloud clustering features indicate that the target has a continuous motion trajectory, and micro-Doppler features show the presence of periodic signals from rotating wheels, thus verifying that the local cue region in the visual feature image belongs to a real moving target—an electric bicycle. Simultaneously, the right edge region of the truck's rear in the visual feature image also exhibits a high response value, but no radar features were extracted for the corresponding spatial location, indicating a false alarm and filtering it out. Through this fusion, mutual verification between visual and radar detection was achieved, improving the accuracy of target detection.

[0065] Further, the step of inputting the fused feature vectors from multiple consecutive frames into the trained sequence model and outputting the motion state of the target vehicle includes: Step e1: Arrange the fused feature vectors of multiple consecutive frames in temporal order to form a temporal sequence.

[0066] Here, the fused feature vector of a single frame only reflects the target features at the current moment and cannot reflect the target's motion patterns over time. By arranging the fused feature vectors of multiple consecutive frames in chronological order, the target's motion history information can be constructed, laying the foundation for subsequent temporal analysis.

[0067] In this embodiment, the fused feature vectors of the current frame and the preceding N consecutive frames are arranged in chronological order of acquisition time to form a time-series sequence of length N. Each element in this time-series sequence is the fused feature vector corresponding to that frame, containing the integrated information after fusing the visual and radar features of that frame. By constructing the time-series sequence, the target detection problem is transformed into a time-series sequence analysis problem, which facilitates the sequence model in learning the motion patterns of the target.

[0068] Step e2: Input the time sequence into the trained sequence model, and the sequence model outputs the position and speed of the target vehicle in the current frame.

[0069] Here, the sequence model learns motion patterns from historical time sequences, enabling it to infer the target state at the current moment from feature changes across multiple consecutive frames, thus achieving accurate estimation of the target's motion state. The predicted position and velocity output by the sequence model can be used for target association in the next frame, prioritizing the search for candidate targets near the predicted position in the next frame, thereby improving matching efficiency and tracking stability.

[0070] In this embodiment, the time sequence constructed in step e1 is input into a trained sequence model, which can employ an LSTM network or a Transformer network. The sequence model captures long-range dependencies in the time sequence through internal gating or attention mechanisms, learns the acceleration and velocity changes of the target motion, and finally outputs the position and velocity of the target vehicle in the current frame as a motion state estimation result.

[0071] Step e3: When the fused feature vector of a certain frame is missing, the sequence model outputs the predicted position and predicted speed of the target vehicle in the current frame based on the historical time sequence.

[0072] In real-world road scenarios, due to occlusion or sensor noise, some frames may not yield effective fusion feature vectors, resulting in incomplete input sequences. By leveraging the predictive capabilities of sequence models, the target state at the current moment can be inferred from historical motion trends even with missing features, maintaining tracking continuity.

[0073] In this embodiment, when a valid fused feature vector cannot be generated for a particular frame, the sequence model does not rely on the input of that frame. Instead, it predicts the target state of that frame based on the motion patterns in the previously input historical time-series sequences, combined with the learned motion laws. Specifically, the sequence model extrapolates the target position and velocity of the current frame based on the position and velocity change trends of the previous few frames, and outputs the predicted position and velocity. This mechanism enables the system to maintain a continuous estimation of the target state even in frames where the target is completely occluded and the sensor cannot capture valid features.

[0074] The predicted position and predicted velocity output by the sequence model are used for data association of the target in the next frame. That is, in the next frame, candidate targets are searched for near the predicted position first. When the target reappears from the occlusion, seamless tracking can be achieved.

[0075] Following the aforementioned urban expressway scenario, the fused feature vectors from five consecutive frames are arranged temporally to form a temporal sequence. Each frame's fused feature vector contains fused information from visual cues of the truck's underside region and radar point cloud features. This temporal sequence is input into an LSTM network, which outputs the position and speed of the electric bicycle in the current frame. In frame 6, because the electric bicycle is completely occluded, effective radar point cloud and visual features cannot be extracted, resulting in a missing fused feature vector for this frame. Based on the historical motion sequence of the previous five frames, the LSTM network predicts the position and speed of the electric bicycle in frame 6, outputting the predicted position and speed, maintaining tracking continuity. When the target is partially revealed again in frame 7, the system successfully matches a new fused feature vector near the predicted position, achieving seamless tracking continuity.

[0076] Further, obtaining at least one supplementary piece of evidence and calculating the probability of the target vehicle's existence based on the at least one supplementary piece of evidence includes: Step f1: Obtain the speed of the vehicle ahead and the speed of the weak radar target corresponding to the point cloud clustering features, and calculate the speed difference between the two as evidence of speed difference.

[0077] Here, if the obscured target vehicle is moving independently, its speed should differ from that of the vehicle in front. By analyzing the speed difference between the two, we can verify whether the target exists independently from the perspective of its motion state. If the speed difference is significant, it indicates that the weak target is an independent moving object.

[0078] In this embodiment, the stable tracking speed of the vehicle ahead is obtained from the perception system, and the estimated speed of the weak radar target corresponding to the point cloud clustering features is obtained simultaneously. The speed difference between the two is calculated. If the absolute value of the speed difference exceeds a preset threshold, the speed difference is used as evidence of speed difference, and the score of this evidence is positively correlated with the magnitude of the speed difference; if the speed difference is small, the score of the speed difference evidence is low. In addition to the speed difference, the acceleration difference between the vehicle ahead and the weak radar target can also be calculated. If the acceleration difference is significant, it is also included as part of the speed difference evidence, enhancing the credibility of the evidence.

[0079] Step f2: Obtain the lidar point cloud, perform ground segmentation on the lidar point cloud to obtain the free space region, and determine whether the point cloud clustering feature or the micro-Doppler feature exists in the free space region as free space evidence.

[0080] Here, lidar can provide high-precision three-dimensional spatial information, and ground segmentation can distinguish between drivable areas and obstacle areas. If a radar signature is detected in the free space area behind the vehicle in front, it indicates the presence of a small target that has not been visually detected in that area, further confirming the target's existence.

[0081] In this embodiment, lidar point cloud data is acquired, and a ground segmentation algorithm is used to divide the point cloud into ground points and non-ground points to obtain a free space region. Then, it is determined whether the point cloud clustering features extracted in step b1 or the micro-Doppler features extracted in step c2 exist in the free space region behind the rear of the vehicle in front. If the aforementioned lidar features exist, the presence of an abnormal signal in that region is considered free space evidence, and the score of this evidence is positively correlated with the credibility of the feature.

[0082] Step f3: Obtain external communication information and determine whether there is target vehicle information in the external communication information that matches the target vehicle's location, as external communication evidence.

[0083] Here, through vehicle-to-everything (V2X) and other vehicle-to-infrastructure (V2I) communication technologies, vehicles can obtain traffic participant information sent by roadside units or other vehicles. If the external communication information reports target information that matches the vehicle's inferred area, it can serve as strong external verification evidence.

[0084] In this embodiment, external communication information is acquired through an onboard communication unit. This information may originate from roadside units, vehicles ahead, or other traffic participants. The target location, type, and other information contained in the external communication information are parsed to determine whether there is two-wheeled vehicle information matching the target vehicle location determined in step e2 or step e3. If matching information exists, this information is used as external communication evidence, and the score of this evidence is positively correlated with the confidence level of the information.

[0085] Step f4: Dynamically adjust the weights of the velocity difference evidence, the free space evidence, and the external communication evidence according to the sensor configuration, and sum the weighted three pieces of evidence to obtain the existence probability.

[0086] Here, different sensors have varying degrees of reliability under different environmental conditions. By dynamically adjusting the weights of each piece of evidence, the probability calculation can be made more adaptable to the actual scenario, thereby improving the accuracy of the judgment.

[0087] In this embodiment, the weights of speed difference evidence, free space evidence, and external communication evidence are dynamically adjusted based on the vehicle's sensor configuration and environmental conditions. For example, if the vehicle is not equipped with a lidar, the weight of free space evidence is set to 0; if the vehicle is in an area without external communication signals, such as a tunnel, the weight of external communication evidence is set to 0; if the confidence level of speed difference evidence is low, its weight is reduced accordingly. The probability of the target vehicle's existence is obtained by multiplying the scores of the three pieces of evidence by their respective weights and then summing them. This probability ranges from 0 to 1, with a higher value indicating a greater likelihood that the target vehicle actually exists.

[0088] Following the aforementioned urban expressway scenario, the stable tracking speed of the truck ahead is 75 km / h, while the speed of the weak radar target is 77 km / h, resulting in a speed difference of 2 km / h, exceeding the preset 1 km / h threshold. The speed difference evidence score is 0.8. The vehicle is equipped with a LiDAR. After ground segmentation of the LiDAR point cloud, point cloud clustering and micro-Doppler features were detected in the free space region within 2 meters behind the truck's rear, resulting in a free space evidence score of 0.9. Simultaneously, the vehicle received information from a roadside unit via V2X, indicating the presence of an electric bicycle on this section of road, matching the vehicle's inferred area. The external communication evidence score is 0.7. Based on the current sensor configuration, the vehicle sets the weight of each of the three pieces of evidence to 1, and after weighted summation, the probability of presence is 2.4. After normalization, the final probability of presence is 0.8, indicating a high probability of an electric bicycle being present in this area.

[0089] Further, the sequence model is trained according to the following steps: Step g1: Obtain historical time-series samples; each sample includes fused feature vectors from multiple consecutive frames and their corresponding annotations of the target vehicle's true motion state.

[0090] Here, sequence models need to learn motion patterns from a large amount of historical data in order to accurately predict the motion state of a target. Therefore, it is necessary to construct a training sample set that includes input features and output labels.

[0091] In this embodiment, a large number of historical time-series samples are extracted from real road data or simulation data. Each sample consists of a fused feature vector of N consecutive frames, which serves as the input to the model. Simultaneously, the actual motion state of the target vehicle in each frame is labeled, including the target vehicle's actual position and speed, serving as the model's supervision signal. The samples cover various typical scenarios, including two-wheeled vehicles closely following the vehicle in front, reappearing after being obscured, and changing gears while following, as well as extreme conditions such as rapid acceleration and deceleration.

[0092] Step g2: Input the historical time series samples into the sequence model to be trained, and the sequence model outputs the predicted motion state.

[0093] Here, during the training process, the constructed training samples are input into the sequence model, and the model outputs the prediction result of the motion state of the current frame based on the input historical time sequence.

[0094] In this embodiment, the temporal sequence samples constructed in step g1 are input into the sequence model to be trained. The sequence model can employ an LSTM network or a Transformer network. The model calculates the predicted position and predicted velocity of the current frame based on the input temporal sequence through forward propagation, and outputs these as the predicted motion state.

[0095] Step g3: Adjust the parameters of the sequence model based on the difference between the predicted motion state and the actual motion state until the preset training stopping condition is met.

[0096] Here, the loss function is calculated by comparing the difference between the model's prediction results and the actual labels, and the model parameters are updated using the backpropagation algorithm, so that the model's prediction ability is gradually improved.

[0097] In this embodiment, mean squared error or smoothed L1 loss is used as the loss function to calculate the difference between the predicted motion state and the actual motion state. The gradient is calculated using the backpropagation algorithm, and the parameters of the sequence model are updated using an optimizer (such as Adam). Steps g2 and g3 are repeated to iteratively optimize the model parameters until the loss function converges or a preset number of training epochs is reached, resulting in a trained sequence model. During training, a validation set can also be introduced for early stopping to prevent overfitting.

[0098] Step g4: After deployment, continuously update the sequence model based on actual operational data.

[0099] Here, real-world road scenarios are complex and varied, and pre-trained models may not be able to cover all operating conditions. By continuously updating the model using real-world operational data after deployment, the model can adapt to new scenarios and maintain tracking accuracy.

[0100] In this embodiment, during the actual operation of the vehicle, high-confidence detection results are used as pseudo-labels to collect real-time data containing fused feature vectors and motion states. At preset time intervals, the newly collected data is used to incrementally train or fine-tune the sequence model, enabling the model to adapt to target motion patterns under different road conditions, weather conditions, and traffic flow conditions, continuously improving the model's generalization ability and tracking accuracy.

[0101] Taking an autonomous driving test fleet as an example, 100,000 time-series samples were extracted from historical road survey data. Each sample contains fused feature vectors for 10 consecutive frames and corresponding annotations of the two-wheeled vehicle's actual position and speed. The samples cover various scenarios such as urban expressways, urban roads, and tunnels, as well as various working conditions such as occlusion, lane changes, and emergency braking. The samples were input into an LSTM network for training, using the mean squared error loss function and the Adam optimizer, with an initial learning rate of 0.001, and trained for 50 epochs until the loss converged. After training, the model was deployed on real vehicles. In actual operation, the system collects high-confidence detection data every 24 hours to incrementally fine-tune the model, allowing it to gradually adapt to newly opened road sections and new driving behavior patterns of the two-wheeled vehicles.

[0102] This application provides a target vehicle detection method, comprising: acquiring image data collected by a vehicle image sensor and radar data collected by a radar sensor; extracting visual features of the target vehicle from the image data to obtain a visual feature map, and extracting point cloud clustering features and micro-Doppler features of the target vehicle from the radar data; the visual feature map indicating local cue regions in the image associated with the target vehicle; fusing the visual feature map, point cloud clustering features, and micro-Doppler features to obtain a fused feature vector, and inputting the fused feature vector of multiple consecutive frames into a trained sequence model to output the motion state of the target vehicle; acquiring at least one supplementary evidence, and calculating the existence probability of the target vehicle based on at least one supplementary evidence; generating a virtual target of the target vehicle based on the existence probability and motion state; the virtual target includes the target vehicle's identifier, position, and speed. Thus, by fusing visual local cues and radar features, and combining a sequence model with multi-evidence reasoning, active detection and state estimation of occluded target vehicles are achieved, improving the accuracy of target detection for autonomous vehicles.

[0103] Based on the same application concept, this application also provides a target vehicle detection device corresponding to the target vehicle detection method provided in the above embodiments. Since the principle of the device in this application is similar to the target vehicle detection method in the above embodiments of this application, the implementation of the device can refer to the implementation of the method, and the repeated parts will not be described again.

[0104] Please see Figure 2 , Figure 2 This is one of the functional block diagrams of a target vehicle detection device provided in an embodiment of this application. Figure 2 As shown, an embodiment of this application provides a target vehicle detection device 200 comprising: The data acquisition module 210 is used to acquire image data collected by the vehicle's image sensor and radar data collected by the radar sensor.

[0105] The feature extraction module 220 is used to extract the visual features of the target vehicle from the image data to obtain a visual feature map, and to extract the point cloud clustering features and micro-Doppler features of the target vehicle from the radar data; the visual feature map indicates the local cue regions in the image that are associated with the target vehicle.

[0106] The state tracking module 230 is used to fuse the visual feature map, the point cloud clustering features and the micro-Doppler features to obtain a fused feature vector, and input the fused feature vector of multiple consecutive frames into the trained sequence model to output the motion state of the target vehicle.

[0107] The probability calculation module 240 is used to acquire at least one supplementary piece of evidence and calculate the probability of the existence of the target vehicle based on the at least one supplementary piece of evidence.

[0108] The target generation module 250 is used to generate a virtual target of the target vehicle based on the existence probability and the motion state; the virtual target includes the identifier, position and speed of the target vehicle.

[0109] Furthermore, when the feature extraction module 220 extracts the visual features of the target vehicle from the image data to obtain a visual feature map, the feature extraction module 220 is specifically used for: The image data is input into a convolutional network, and features of the ground contact area at the bottom of the vehicle in front, the two side edge areas at the rear of the vehicle in front, and the projection and shadow areas of the ground behind the vehicle in front are extracted based on the attention module in the convolutional network. The extracted features are generated into a salient region heatmap, which serves as the visual feature map; wherein, the training supervision signal for the salient region heatmap is the local region annotation related to the target vehicle.

[0110] Furthermore, when extracting point cloud clustering features of the target vehicle from the radar data, the feature extraction module 220 is specifically used for: Clustering is performed on the radar data, and point cloud clusters that are less than a first threshold in distance from each other and have more than a second threshold in number are identified as candidate targets; By associating point cloud clusters belonging to the same candidate target in multiple consecutive frames, the associated trajectory is obtained; The associated trajectories are filtered to obtain the point cloud clustering features.

[0111] Furthermore, when extracting the micro-Doppler features of the target vehicle from the radar data, the feature extraction module 220 is specifically used for: The radar data is subjected to time-frequency transformation to obtain a spectrum diagram; The periodically changing curve pattern is extracted from the spectrum and used as the micro-Doppler feature.

[0112] Furthermore, when the state tracking module 230 performs feature fusion on the visual feature map, the point cloud clustering features, and the micro-Doppler features, the state tracking module 230 is specifically used for: The local cue regions in the visual feature map are used as spatial constraints to limit the extraction range of the point cloud clustering features and the micro-Doppler features. The point cloud clustering features and the micro-Doppler features are used as motion verification to determine whether the local cue regions in the visual feature map belong to moving targets.

[0113] Furthermore, when the state tracking module 230 inputs the fused feature vectors from multiple consecutive frames into the trained sequence model and outputs the motion state of the target vehicle, the state tracking module 230 is specifically used for: Arrange the fused feature vectors of multiple consecutive frames in temporal order to form a temporal sequence; The time sequence is input into the trained sequence model, and the sequence model outputs the position and velocity of the target vehicle in the current frame. When the fused feature vector of a certain frame is missing, the sequence model outputs the predicted position and predicted speed of the target vehicle in the current frame based on the historical time sequence.

[0114] Furthermore, when the probability calculation module 240 is used to acquire at least one supplementary piece of evidence and calculate the probability of the existence of the target vehicle based on the at least one supplementary piece of evidence, the probability calculation module 240 is specifically used for: The speed of the vehicle ahead is obtained and the speed of the weak radar target corresponding to the point cloud clustering features are compared. The speed difference between the two is calculated as evidence of the speed difference. Acquire lidar point clouds, perform ground segmentation on the lidar point clouds to obtain free space regions, and determine whether the point cloud clustering features or the micro-Doppler features exist in the free space regions as free space evidence. Acquire external communication information and determine whether there is target vehicle information in the external communication information that matches the location of the target vehicle, as external communication evidence; The weights of the velocity difference evidence, the free space evidence, and the external communication evidence are dynamically adjusted according to the sensor configuration. The three weighted pieces of evidence are then summed to obtain the probability of existence.

[0115] Further, please refer to Figure 3 , Figure 3 This is a second functional block diagram of a target vehicle detection device provided in an embodiment of this application. Figure 3 As shown, the target vehicle detection device 200 also includes: The sample acquisition module 260 is used to acquire historical time-series samples; each sample includes a fused feature vector of multiple consecutive frames and its corresponding annotation of the target vehicle's real motion state.

[0116] The model training module 270 is used to input the historical time series samples into the sequence model to be trained, and the sequence model outputs the predicted motion state.

[0117] The parameter adjustment module 280 is used to adjust the parameters of the sequence model according to the difference between the predicted motion state and the actual motion state until a preset training stopping condition is met.

[0118] The model update module 290 is used to continuously update the sequence model based on actual operating data after deployment.

[0119] This application provides a target vehicle detection device, comprising: a data acquisition module for acquiring image data collected by a vehicle image sensor and radar data collected by a radar sensor; a feature extraction module for extracting visual features of the target vehicle from the image data to obtain a visual feature map, and extracting point cloud clustering features and micro-Doppler features of the target vehicle from the radar data; the visual feature map indicates local cue regions in the image associated with the target vehicle; a state tracking module for fusing the visual feature map, point cloud clustering features, and micro-Doppler features to obtain a fused feature vector, and inputting the fused feature vector of multiple consecutive frames into a trained sequence model to output the motion state of the target vehicle; a probability calculation module for acquiring at least one supplementary evidence and calculating the existence probability of the target vehicle based on at least one supplementary evidence; and a target generation module for generating a virtual target of the target vehicle based on the existence probability and motion state; the virtual target includes the target vehicle's identifier, position, and speed. Thus, by fusing visual local cues and radar features, and combining a sequence model with multi-evidence reasoning, active detection and state estimation of occluded target vehicles are achieved, improving the accuracy of target detection for autonomous vehicles.

[0120] Based on the same application concept, please refer to Figure 4 , Figure 4 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Figure 4 As shown, the electronic device 400 includes a processor 410, a memory 420, and a bus 430.

[0121] The memory 420 stores machine-readable instructions executable by the processor 410. When the electronic device 400 is running, the processor 410 and the memory 420 communicate through the bus 430. When the machine-readable instructions are executed by the processor 410, the steps of the target vehicle detection method provided in the above embodiment are executed. For specific implementation methods, please refer to the method embodiment, which will not be repeated here.

[0122] Based on the same concept, this application also provides a computer-readable storage medium storing a computer program. When the computer program is run by a processor, it executes the steps of the target vehicle detection method provided in the above embodiments. For specific implementation details, please refer to the method embodiments, which will not be repeated here.

[0123] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working process of the above-described apparatus and unit can be referred to the corresponding process in the foregoing method embodiments, and will not be repeated here.

[0124] In the embodiments provided in this application, it should be understood that the disclosed apparatus and methods can be implemented in other ways. The apparatus embodiments described above are merely illustrative. For example, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. Furthermore, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Additionally, the displayed or discussed mutual couplings, direct couplings, or communication connections may be through some communication interfaces; indirect couplings or communication connections between devices or units may be electrical, mechanical, or other forms.

[0125] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.

[0126] In addition, the functional units in the embodiments provided in this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit.

[0127] If the aforementioned functions are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or a portion of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0128] It should be noted that similar labels and letters in the following figures indicate similar items. Therefore, once an item is defined in one figure, it does not need to be further defined and explained in subsequent figures. In addition, the terms "first", "second", "third", etc. are used only to distinguish descriptions and should not be construed as indicating or implying relative importance.

[0129] Finally, it should be noted that the above-described embodiments are merely specific implementations of this application, used to illustrate the technical solutions of this application, and not to limit them. The protection scope of this application is not limited thereto. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that any person skilled in the art can still modify or easily conceive of changes to the technical solutions described in the foregoing embodiments, or make equivalent substitutions for some of the technical features, within the scope of the technology disclosed in this application; and these modifications, changes, or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application. All should be covered within the protection scope of this application. Therefore, the protection scope of this application should be determined by the protection scope of the claims.

Claims

1. A method for detecting a target vehicle, characterized in that, The method includes: Acquire image data collected by the vehicle's image sensor and radar data collected by the radar sensor; Visual features of the target vehicle are extracted from the image data to obtain a visual feature map, and point cloud clustering features and micro-Doppler features of the target vehicle are extracted from the radar data; the visual feature map indicates local cue regions in the image that are associated with the target vehicle. The visual feature map, the point cloud clustering features, and the micro-Doppler features are fused to obtain a fused feature vector. The fused feature vector of multiple consecutive frames is then input into a trained sequence model to output the motion state of the target vehicle. Obtain at least one piece of supplementary evidence, and calculate the probability of the existence of the target vehicle based on the at least one piece of supplementary evidence; A virtual target of the target vehicle is generated based on the existence probability and the motion state; the virtual target includes the identifier, position and speed of the target vehicle.

2. The target vehicle detection method according to claim 1, characterized in that, The step of extracting visual features of the target vehicle from the image data to obtain a visual feature map includes: The image data is input into a convolutional network, and features of the ground contact area at the bottom of the vehicle in front, the two side edge areas at the rear of the vehicle in front, and the projection and shadow areas of the ground behind the vehicle in front are extracted based on the attention module in the convolutional network. The extracted features are generated into a salient region heatmap, which serves as the visual feature map; wherein, the training supervision signal for the salient region heatmap is the local region annotation related to the target vehicle.

3. The target vehicle detection method according to claim 1, characterized in that, The extraction of point cloud clustering features of the target vehicle from the radar data includes: Clustering is performed on the radar data, and point cloud clusters that are less than a first threshold in distance from each other and have more than a second threshold in number are identified as candidate targets; By associating point cloud clusters belonging to the same candidate target in multiple consecutive frames, the associated trajectory is obtained; The associated trajectories are filtered to obtain the point cloud clustering features.

4. The target vehicle detection method according to claim 1, characterized in that, Extracting the micro-Doppler features of the target vehicle from the radar data includes: The radar data is subjected to time-frequency transformation to obtain a spectrum diagram; The periodically changing curve pattern is extracted from the spectrum and used as the micro-Doppler feature.

5. The target vehicle detection method according to claim 1, characterized in that, The feature fusion of the visual feature map, the point cloud clustering features, and the micro-Doppler features includes: The local cue regions in the visual feature map are used as spatial constraints to limit the extraction range of the point cloud clustering features and the micro-Doppler features. The point cloud clustering features and the micro-Doppler features are used as motion verification to determine whether the local cue regions in the visual feature map belong to moving targets.

6. The target vehicle detection method according to claim 1, characterized in that, The step of inputting the fused feature vectors from multiple consecutive frames into the trained sequence model and outputting the motion state of the target vehicle includes: Arrange the fused feature vectors of multiple consecutive frames in temporal order to form a temporal sequence; The time sequence is input into the trained sequence model, and the sequence model outputs the position and velocity of the target vehicle in the current frame. When the fused feature vector of a certain frame is missing, the sequence model outputs the predicted position and predicted speed of the target vehicle in the current frame based on the historical time sequence.

7. The target vehicle detection method according to claim 1, characterized in that, The step of obtaining at least one supplementary piece of evidence and calculating the probability of the existence of the target vehicle based on the at least one supplementary piece of evidence includes: The speed of the vehicle ahead is obtained and the speed of the weak radar target corresponding to the point cloud clustering features are compared. The speed difference between the two is calculated as evidence of the speed difference. Acquire lidar point clouds, perform ground segmentation on the lidar point clouds to obtain free space regions, and determine whether the point cloud clustering features or the micro-Doppler features exist in the free space regions as free space evidence. Acquire external communication information and determine whether there is target vehicle information in the external communication information that matches the location of the target vehicle, as external communication evidence; The weights of the velocity difference evidence, the free space evidence, and the external communication evidence are dynamically adjusted according to the sensor configuration. The three weighted pieces of evidence are then summed to obtain the probability of existence.

8. A target vehicle detection device, characterized in that, The target vehicle detection device includes: The data acquisition module is used to acquire image data collected by the vehicle's image sensor and radar data collected by the radar sensor; The feature extraction module is used to extract the visual features of the target vehicle from the image data to obtain a visual feature map, and to extract the point cloud clustering features and micro-Doppler features of the target vehicle from the radar data; the visual feature map indicates the local cue regions in the image that are associated with the target vehicle; The state tracking module is used to fuse the visual feature map, the point cloud clustering features and the micro-Doppler features to obtain a fused feature vector, and input the fused feature vector of multiple consecutive frames into the trained sequence model to output the motion state of the target vehicle. A probability calculation module is used to acquire at least one piece of supplementary evidence and calculate the probability of the existence of the target vehicle based on the at least one piece of supplementary evidence. The target generation module is used to generate a virtual target of the target vehicle based on the existence probability and the motion state; the virtual target includes the identifier, position and speed of the target vehicle.

9. An electronic device, characterized in that, include: The device includes a processor, a memory, and a bus. The memory stores machine-readable instructions executable by the processor. When the electronic device is running, the processor communicates with the memory via the bus. The machine-readable instructions are executed by the processor to perform the steps of the target vehicle detection method as described in any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program that, when executed by a processor, performs the steps of the target vehicle detection method as described in any one of claims 1 to 7.