Multi-sensor fusion target detection method, system, medium, device and terminal

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By employing a multi-sensor fusion target detection method, utilizing extended Kalman filtering and camera intrinsic and extrinsic parameter calibration, and combining attention mechanism to process radar points, the problem of detection accuracy and robustness of a single sensor in complex scenarios is solved, achieving accurate detection of vehicles and pedestrians under occlusion conditions.

CN116310679BActive Publication Date: 2026-06-19QINGDAO INST OF COMPUTING TECH XIDIAN UNIV

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: QINGDAO INST OF COMPUTING TECH XIDIAN UNIV
Filing Date: 2023-03-04
Publication Date: 2026-06-19

Application Information

Patent Timeline

04 Mar 2023

Application

19 Jun 2026

Publication

CN116310679B

IPC: G06V10/80; G06V10/25; G06V10/82; G06N3/08; G06T7/80

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

In existing technologies, single sensors suffer from problems such as limited information, low accuracy, and poor environmental adaptability in target detection. They perform particularly poorly in complex vehicle-road scenarios where vehicles and pedestrians obstruct the view. Furthermore, sensor fusion methods suffer from errors in coordinate system calibration and transformation, which affect detection accuracy and robustness.

⚗Method used

A multi-sensor fusion target detection method is adopted. The radar points are preprocessed by the extended Kalman filter algorithm, and the feature maps are extracted by combining camera intrinsic and extrinsic parameter calibration and Zhang Zhengyou calibration method. The region of interest generated by the radar points is processed by self-attention and cross-attention to realize the spatial and temporal correlation between radar and image. Finally, the detection results are matched by the Hungarian algorithm.

🎯Benefits of technology

It improves the robustness and accuracy of target detection, enabling effective detection of vehicles and pedestrians in complex scenarios, enhancing the safety and efficiency of autonomous driving and security systems, and achieving accurate identification and tracking of vehicles and pedestrians.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN116310679B_ABST

Patent Text Reader

Abstract

This invention belongs to the field of target detection technology and discloses a multi-sensor fusion target detection method, system, medium, device, and terminal. It utilizes dual channels of radar point projection and visual images, employs a Kalman filter algorithm to correlate radar target frames, uses a ResNet-50 backbone as the feature extraction network for the feature map, and uses the Zhang Zhengyou calibration method to calibrate the camera's intrinsic and extrinsic parameters. Spatial correlation is achieved by projecting radar points onto the image through coordinate system transformation, and temporal correlation is achieved using Lagrange interpolation. The final target detection result is obtained by applying self-attention and cross-attention processing to the regions of interest generated by the radar points. This invention leverages the location features of the radar points based on an attention mechanism to help detect the category and location information of the image, resulting in higher robustness and accuracy in detecting the location and category of the image. The effectiveness of this invention is experimentally verified on a real-world vehicle-road dataset.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of target detection technology, and in particular relates to a multi-sensor fusion target detection method, system, medium, device and terminal. Background Technology

[0002] In recent years, with the rapid development of computer vision, object detection technology has been applied to fields such as autonomous driving, smart healthcare, intelligent security, and text recognition. In autonomous driving, the vehicle-road scenarios present numerous complex obstacles, and object detection can quickly identify vehicles, pedestrians, and other obstacles, ensuring the real-time performance and accuracy of target detection, thereby improving the safety of autonomous driving. Object detection is also applied to medical image-assisted analysis, demonstrating high accuracy in classifying, recognizing, and predicting diseases. Object detection algorithms can accurately detect text in image scenes and translate text images into text characters. In intelligent security, object detection extracts valid and invalid targets from the foreground of real-time video, improving the robustness of security and reducing the workload of security personnel. However, using a single sensor for object detection has drawbacks such as limited information and low accuracy. These shortcomings significantly impact the accuracy of object detection, limiting its implementation and application. Each sensor has its own advantages and disadvantages. Camera sensors offer advantages such as low cost, rich information, and ease of perception and classification. However, they suffer from poor adaptability to lighting conditions and difficulty in acquiring three-dimensional target information. Radar sensors, on the other hand, are highly adaptable to weather conditions, can operate at night and in various complex environments, and can accurately acquire the vertical position and velocity of targets. Their disadvantages include difficulty in target classification and the inability to distinguish stationary targets. Therefore, fusing data from diverse and heterogeneous sensors to enhance target detection capabilities has become a new and promising approach.

[0003] However, in complex vehicle-road scenarios such as intersections, occlusion issues arise between vehicles and pedestrians. Furthermore, for real-world complex vehicle-road scenarios, global contextual information is crucial. This invention demonstrates that existing target detection strategies based on sensor fusion methods perform poorly in scenarios with partial occlusion of vehicles and pedestrians and in complex environments. These scenarios require global contextual reasoning, such as handling traffic from multiple directions at uncontrolled intersections. Therefore, there is an urgent need to design a novel multi-sensor fusion target detection method and system.

[0004] Based on the above analysis, the problems and shortcomings of the existing technology are as follows:

[0005] (1) Millimeter-wave radar has measurement errors at different distances. The increase in measurement errors will filter out the real radar points, which will seriously affect the accuracy and robustness of fusion detection. Secondly, mainstream fusion methods need to complete the calibration and transformation of the coordinate system. However, in the actual vehicle-road scenario, the calibration and transformation of radar points to the image coordinate system will have the problem of inaccurate projection. Because the radar field of view and angle, as well as the transformation matrix used for projection, have the influence of errors, the radar points cannot be accurately projected to the vicinity of the target, which will eventually lead to the failure of target association and the reduction of target detection robustness.

[0006] (2) Existing camera sensors are poorly adapted to lighting environments and have difficulty acquiring three-dimensional information of targets; radar sensors have difficulty classifying targets and cannot distinguish stationary targets.

[0007] (3) Traditional convolutional neural networks cannot obtain contextual information at the fusion point. Therefore, in complex and dense scenes, the target detection strategy based on the existing sensor fusion method has poor performance in the case of partial occlusion by vehicles and pedestrians and complex scenes. Summary of the Invention

[0008] To address the problems existing in the prior art, the present invention provides a multi-sensor fusion target detection method, system, medium, device and terminal, and particularly relates to a multi-sensor fusion target detection method, system, medium, device and terminal based on an attention mechanism.

[0009] This invention is implemented as follows: a multi-sensor fusion target detection method, comprising: based on the spatiotemporal correlation of radar and camera sensor data, preprocessing radar points using an extended Kalman filter algorithm. The algorithm can adaptively learn filter hyperparameters, improving sensor detection accuracy, reducing the impact of sensor detection errors on fusion detection, and completing the spatiotemporal correlation of multi-sensor data; using a ResNet-50 backbone as the feature extraction network for the feature map; and using the Zhang Zhengyou calibration method to calibrate the camera's intrinsic and extrinsic parameters; achieving spatial correlation by projecting radar points onto the image through coordinate system transformation; and achieving temporal correlation using Lagrange interpolation; and obtaining the final target detection result by performing self-attention and cross-attention processing on the regions of interest generated from the radar points.

[0010] Furthermore, the multi-sensor fusion target detection method includes the following steps:

[0011] Step 1: Construct a radar target tracking algorithm based on the extended Kalman filter algorithm, and establish relevant motion state prediction and update equations based on the radar's state transition matrix and noise parameters;

[0012] Step 2, Camera parameter calibration: Match the points in the world coordinate system with the points in the pixel coordinate system to obtain the camera's intrinsic and extrinsic parameters, which are used for conversion between the camera coordinate system and the world coordinate system;

[0013] Step 3, Spatial association between radar and image: Using intrinsic and extrinsic parameter matrices, the world coordinate system is transformed to the image coordinate system, and the radar points are transformed to the image coordinate system, thus realizing the spatial association between radar points and the image;

[0014] Step 4, radar and image temporal correlation: The millimeter-wave radar inputs the detection results every 20Hz in the two-dimensional coordinate system, and performs Lagrange interpolation on each vehicle trajectory to resample the radar data;

[0015] Step 5, Region of Interest Generation: After projecting the radar points onto the image, the pixels near the radar points are taken as the region of interest, and the region of interest is taken as the focus area for self-attention and cross-attention.

[0016] Step 6, Attention-based object detection: Feature maps are generated and features are extracted using a ResNet-50 backbone convolutional neural network, and the final object detection result is obtained through self-attention processing.

[0017] Furthermore, the radar target tracking algorithm constructed in step one based on the extended Kalman filter algorithm includes:

[0018] (1) Derive the radar measurement function

[0019] ρ represents the distance from the radar to the obstacle. It is the orientation angle of the obstacle. Rotating counterclockwise from the x-axis is positive; the actual measured angle... It is negative; radial velocity. The velocity v is projected onto the radar line. The radar data is then processed to convert polar coordinates to Cartesian coordinates. The conversion formula is as follows:

[0020] Distance ρ is the distance from the radar to the obstacle, defined as:

[0021]

[0022] It is the angle between the ρ and x directions, defined as:

[0023]

[0024] radial velocity Then it is defined as:

[0025]

[0026] The measurement function obtained from the millimeter-wave radar is:

[0027]

[0028] (2) Construct the Jacobian matrix of the extended Kalman filter.

[0029]

[0030] (3) Predicting radar motion state

[0031] When the current position and velocity of the radar target are ρ, and the object maintains the same velocity while moving, the prediction equation is as follows:

[0032] ρ′=f(ρ,u)

[0033] P′=FPF T +Q

[0034] (4) Update radar motion status

[0035] By using the measurement function to map the state vector to the sensor's measurement space, and comparing the lidar's measured position with the predicted object position, the lidar update equation is as follows:

[0036] y′=zh(x′)

[0037] S=HP′H T +R

[0038] K = P'H T S -1

[0039] x=x′-Ky

[0040] P=(I-KH)P′

[0041] Where x is the distance from the sensor to the front of the target vehicle, y is the lateral distance from the vehicle to the target vehicle, P is the covariance matrix of the predicted value, and Kalman gain K is calculated together with H and sensor error R.

[0042] Furthermore, the camera parameter calibration in step two includes: obtaining the camera's intrinsic and extrinsic parameters by taking multiple images of the calibration board and using the Zhang Zhengyou chessboard annotation method; taking pictures of the chessboard grid from different angles and distances with the camera to obtain the camera's intrinsic and extrinsic parameters, and transforming the camera coordinate system to the image coordinate system and pixel coordinate system to obtain the camera's intrinsic and extrinsic parameters and scaling factor, as shown in the following formula:

[0043]

[0044] Furthermore, in step three, the camera is initially calibrated by using the intrinsic and extrinsic parameter matrices obtained in step two, and the conversion between the image coordinate system and the world coordinate system is realized. The world coordinate system is established with the camera as the origin. After converting the radar relative coordinate system to the world coordinate system, the radar points are converted to the image coordinate system.

[0045] Choosing the 0° direction of the camera as the y-axis of the world coordinate system, and the 90° clockwise direction of the y-axis as the x-axis of the world coordinate system, the Zhang Zhengyou calibration method is used to obtain the ratio of image coordinates to world coordinates and the camera's intrinsic and extrinsic parameter matrices. The correspondence between the image coordinate system and the world coordinate system is then obtained, described by the following equation:

[0046]

[0047] Where H represents the product of intrinsic and extrinsic parameter matrices, Z represents the scaling factor between pixel coordinates and world coordinates, u and v represent pixel coordinates, and U and V represent the corresponding world coordinates. The world coordinates corresponding to the pixel coordinates are obtained through coordinate transformation, thus completing the transformation between the pixel coordinate system and the world coordinate system.

[0048] Furthermore, in step six, after radar target tracking, radar point and image temporal and spatial alignment, and generation of region of interest, feature maps of different dimensions and scales are generated through a ResNet-50 backbone convolutional neural network. The feature maps are translated into a feature map sequence, which is input into the encoder to extract features from the feature map sequence, and self-attention and cross-attention are learned in the decoder. Self-attention processing is performed on the region of interest in the feature map sequence to obtain the final target detection result.

[0049] The process involves a decoder predicting N predictions of a fixed size, where N is set to be significantly larger than the typical number of objects in an image. When radar projection onto the image generates m regions of interest (ROIs), Nm random predictions are generated based on these m ROIs. The decoder outputs N prediction results, each containing a tuple (c, box) representing the detected category and location, where c represents the object category and box represents the location of the detected bounding box in the image. A Hungarian algorithm is used for bipartite graph matching, mapping elements between the prediction and ground truth sets, calculating the loss between each prediction and the ground truth set, and minimizing the total matching loss. The detection results consist of a category-location tuple, and the loss value for each prediction consists of a category loss and a position loss, representing a linear combination of the two losses. Finally, the Hungarian algorithm is used to match all optimal solutions. The expression for the detection results is as follows:

[0050]

[0051]

[0052] Since the detection results consist of a class and a location pair, as shown in Equation 4-5, the loss value for each prediction result consists of the class loss L1 and the box location loss L2. iou It consists of two parts, where b represents a linear combination of two losses. i Represents the actual value. This represents the predicted value. Finally, as described in Equation 4-6, Indicates the predicted location of the target. This indicates that the target category being identified is an empty set. The value is 0 when the condition is met, and 0 otherwise. The optimal loss L for all target values matched by the Hungarian algorithm is obtained. H .

[0053] Another object of the present invention is to provide a multi-sensor fusion target detection system applying the aforementioned multi-sensor fusion target detection method, the multi-sensor fusion target detection system comprising:

[0054] The radar target frame association module is used to associate radar target frames using a Kalman filter algorithm through dual channels of radar point projection and visual image.

[0055] The camera intrinsic and extrinsic parameter calibration module is used to calibrate the camera intrinsic and extrinsic parameters using the Zhang Zhengyou calibration method, which is a feature extraction network that uses the Resnet-50 backbone as the feature map.

[0056] The spatial / temporal correlation module is used to achieve spatial correlation by projecting radar points onto the image through coordinate system transformation, and to achieve temporal correlation using Lagrange interpolation.

[0057] The fusion target detection module is used to obtain the final target detection result by performing self-attention and cross-attention processing on the region of interest generated by radar points.

[0058] Another object of the present invention is to provide a computer device, the computer device including a memory and a processor, the memory storing a computer program, which, when executed by the processor, causes the processor to perform the steps of the multi-sensor fusion target detection method.

[0059] Another object of the present invention is to provide a computer-readable storage medium storing a computer program, which, when executed by a processor, causes the processor to perform the steps of the multi-sensor fusion target detection method.

[0060] Another objective of this invention is to provide an information data processing terminal for implementing the aforementioned multi-sensor fusion target detection system.

[0061] Based on the above technical solutions and the technical problems solved, the advantages and positive effects of the technical solution to be protected by this invention are as follows:

[0062] First, addressing the technical problems existing in the prior art and the difficulty of solving them, this paper closely analyzes, in conjunction with the technical solution to be protected by this invention and the results and data obtained during the research and development process, how the technical solution of this invention solves the technical problems, and the inventive technical effects brought about by solving these problems. The specific description is as follows:

[0063] To address the poor robustness of existing single-sensor-based target detection methods in target detection and tracking scenarios, this invention provides a solution using millimeter-wave radar and camera fusion for target detection. Furthermore, to address the issue of inadequate performance of existing sensor fusion target detection strategies in complex and dense scenes with partial occlusion by vehicles and pedestrians, and in complex environments, this invention proposes a novel multi-sensor fusion target detection method based on an attention mechanism. This method integrates image and millimeter-wave radar data using an attention mechanism. Finally, experiments on a real-world vehicle-road dataset validate the effectiveness of this fusion target detection method.

[0064] The multi-sensor fusion target detection method of this invention sets a fixed length and width for radar points projected onto the image to generate a region of interest (ROI). This ROI can roughly cover the location of the target, and it serves as the input to the decoder. An attention mechanism is applied to the ROI for target detection. Using the attention mechanism is beneficial for extracting and detecting image context information, further improving the robustness of detection and fusion. Even when the points projected onto the image by the radar are offset, this method can still obtain the target detection result. Finally, a bipartite graph matching loss based on the Hungarian algorithm is used to calculate the matching loss between the predicted bounding box and the predicted category output by the decoder. The object category and the predicted bounding box in the image are output as a pair, and Hungarian matching is performed with the object category and location in the ground truth labeled image to accelerate the convergence of the model.

[0065] This invention presents a radar target tracking algorithm based on the extended Kalman filter (EPF) algorithm. Given that the radial velocity, distance to obstacles, and angle of the radar target are all nonlinear models, and both process and observation noise exhibit Gaussian distributions, Kalman filtering cannot be used for target tracking. This invention establishes relevant motion state prediction and update equations based on the radar's state transition matrix and noise parameters, effectively achieving the correlation of radar targets between frames and efficiently obtaining the radar's ID. Furthermore, this invention projects radar points onto the image and uses pixels near the radar points as regions of interest (ROIs). These ROIs are then used as focus areas for self-attention and cross-attention, facilitating the implementation of attention-based target detection methods.

[0066] This invention addresses the occlusion problem of vehicles and pedestrians in complex vehicular and road scenarios, such as intersections, and extracts key global contextual information in real-world, complex vehicular and road environments. The invention demonstrates that existing target detection strategies based on sensor fusion methods perform poorly in scenarios with partial occlusion of vehicles and pedestrians and in complex environments. These scenarios require global contextual reasoning, such as handling traffic from multiple directions at uncontrolled intersections. This invention utilizes an attention mechanism based on radar projection points, leveraging the features of the radar point locations to aid in detecting image category and location information, resulting in higher robustness and accuracy in detecting image location and category.

[0067] This invention integrates radar and video surveillance information on the speed, distance, azimuth, and direction of movement of targets within the monitored area. It intelligently drives a high-definition, high-speed network intelligent PTZ camera to perform real-time dynamic tracking and intelligent zoom-based clear capture and verification of intruding targets. Simultaneously, it proactively issues real-time audible and visual alarms, clearly displays the target's movement trajectory and intrusion scene, and enables the radar to automatically detect and identify suspicious intrusion targets, automatically activate cameras to track and monitor suspicious targets, and automatically trigger audible, visual, and SMS alarms. This significantly improves the work efficiency of surveillance personnel and enables target detection and alarm processing for pedestrians within the monitored area.

[0068] Second, considering the technical solution as a whole or from a product perspective, the technical effects and advantages of the technical solution to be protected by this invention are specifically described as follows:

[0069] This invention provides a multi-sensor fusion target detection method based on an attention mechanism, which is a robust target detection algorithm that can perform well in situations where vehicles and pedestrians partially obscure the target and in complex scenes.

[0070] Third, as supplementary evidence of the inventive step of the claims of this invention, it is also reflected in the following important aspects:

[0071] The technical solution of this invention fills a technological gap in the industry both domestically and internationally:

[0072] Single-sensor visual cameras have limited detection and recognition accuracy and poor stability, and their detection range is also inaccurate. Furthermore, cameras are susceptible to factors such as lighting and weather, especially at night, in fog, and in rain. In contrast, radar is less affected by weather conditions, has higher stability, and provides more accurate distance measurements, enabling it to measure greater distances. However, current millimeter-wave radar has low resolution and is sensitive to metal, resulting in relatively poor object recognition performance and an inability to acquire target feature information. Therefore, in real-world environments, a single sensor cannot solve all target detection and tracking problems. The fusion of millimeter-wave radar and cameras is a growing trend in target detection.

[0073] Convolutional neural networks (CNNs) are limited by their receptive field and feature map scale, resulting in poor performance when vehicles are occluded, failing to extract global contextual information about occluded pedestrians and vehicles. Therefore, this invention proposes using the Transformer model to process feature maps. Combining the Transformer's powerful global feature extraction capabilities can effectively address the problem of occluded pedestrians and vehicles in complex road and vehicle scenarios. Attached Figure Description

[0074] To more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the embodiments of the present invention will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0075] Figure 1 This is a flowchart of the multi-sensor fusion target detection method provided in the embodiments of the present invention;

[0076] Figure 2 This is a schematic diagram of the multi-sensor fusion target detection method provided in the embodiments of the present invention;

[0077] Figure 3 This is a schematic diagram of radar data and visual image interpolation provided in an embodiment of the present invention;

[0078] Figure 4 This is a flowchart of radar and image spatial association provided in an embodiment of the present invention;

[0079] Figure 5 This is a schematic diagram of camera parameter calibration using the checkerboard calibration method provided in an embodiment of the present invention;

[0080] Figure 6 This is a schematic diagram of the intelligent sentinel security system provided in an embodiment of the present invention;

[0081] In the diagram: 1. IoT device; 11. Camera; 12. Relay; 13. Alarm; 14. Radar; 2. Internet; 3. Server; 31. Streaming box; 32. Server; 4. Core router; 5. Core switch; 6. Client; 61. PC; 62. Mobile. Detailed Implementation

[0082] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the invention.

[0083] To address the problems existing in the prior art, the present invention provides a multi-sensor fusion target detection method, system, medium, device, and terminal. The present invention will be described in detail below with reference to the accompanying drawings.

[0084] like Figure 1 As shown, the multi-sensor fusion target detection method provided in this embodiment of the invention includes the following steps:

[0085] S101 uses a dual-channel approach of radar point projection and visual image to perform correlation between radar target frames using a Kalman filter algorithm.

[0086] S102 uses the ResNet-50 backbone as the feature extraction network for the feature map and uses Zhang Zhengyou calibration method to calibrate the camera's intrinsic and extrinsic parameters.

[0087] S103 achieves spatial correlation by projecting radar points onto the image through coordinate system transformation, and achieves temporal correlation by using Lagrange interpolation.

[0088] S104, by performing self-attention and cross-attention processing on the region of interest generated by the radar points, the final target detection result is obtained.

[0089] As a preferred embodiment, such as Figure 2 As shown, the multi-sensor fusion target detection method provided in this embodiment of the invention specifically includes the following steps:

[0090] S1: Radar target tracking algorithm based on extended Kalman filter: Given that the radial velocity, distance to obstacles, and angle of the radar target are all nonlinear models, and both process noise and observation noise exhibit Gaussian distributions, Kalman filtering cannot be used for target tracking. Therefore, relevant motion state prediction and update equations are established based on the radar's state transition matrix and noise parameters. This step effectively correlates radar targets between frames and effectively obtains the radar's ID.

[0091] In the radar target tracking algorithm based on the extended Kalman filter provided in this embodiment of the invention, the transformation function is not a linear function, and Gaussian distribution cannot be applied to nonlinear measurements, thus Kalman filtering cannot be used. To solve this problem, extended Kalman filtering is required.

[0092] The radar target tracking algorithm based on the extended Kalman filter algorithm includes the following four steps:

[0093] (1) Derive the radar measurement function;

[0094] (2) Jacobian matrix of extended Kalman filter;

[0095] (3) Predict the motion state of the radar;

[0096] (4) Update radar motion status.

[0097] S2: Camera Parameter Calibration: By taking multiple images of the calibration board and then mapping multiple real-world points (points in the world coordinate system) to points in the images (points in the pixel coordinate system), the correspondence between world coordinates and pixel coordinates can be determined. This step obtains the camera's intrinsic and extrinsic parameters, which are used for conversion between the camera coordinate system and the world coordinate system.

[0098] In the camera parameter calibration provided in this embodiment of the invention, the intrinsic and extrinsic parameters of the camera are obtained by taking multiple pictures of the calibration board and using the Zhang Zhengyou chessboard annotation method. The camera is used to take pictures of the chessboard grid from different angles and distances in order to obtain the intrinsic and extrinsic parameters of the camera and transform the camera coordinate system to the image coordinate system and the pixel coordinate system, thereby obtaining the camera intrinsic and extrinsic parameters and the scaling factor.

[0099] By using step S2 to obtain the intrinsic and extrinsic parameter matrices, the initial camera calibration is completed. The purpose is to realize the transformation between the image coordinate system and the world coordinate system. The world coordinate system is established with the camera as the origin.

[0100] S3: Spatial association between radar and image. By using the intrinsic and extrinsic parameter matrices obtained in step S2, the transformation from the world coordinate system to the image coordinate system can be achieved. After transforming the radar relative coordinate system to the world coordinate system, the radar points can be further transformed to the image coordinate system, realizing the spatial association between radar points and the image.

[0101] S4: Radar and Image Temporal Correlation: The millimeter-wave radar inputs detection results every 20Hz in its two-dimensional coordinate system, including the position, velocity, and target ID obtained in step S1 for each object. To maintain consistency between the radar data sampling frequency and the video frames, this method performs Lagrange interpolation on each vehicle trajectory to resample the radar data. It is assumed that the vehicle speed changes very little within a short period of time, therefore the same speed is used within this time period. Interpolation effectively aligns the time.

[0102] S5: Generating Region of Interest (ROI): After projecting radar points onto the image, this invention uses the pixels near the radar points as the ROI. The generated ROI will serve as the focus area for self-attention and cross-attention. The ROI generated in this step facilitates the implementation of the attention-based target detection method in step S6.

[0103] S6: Attention-based target detection method: After the first five steps of radar target tracking, radar point and image temporal and spatial alignment and generation of region of interest, feature maps of different dimensions and scales are generated by ResNet-50 backbone convolutional neural network. The feature maps are translated into a feature map sequence and input into the encoder to extract features from the feature map sequence. Self-attention and cross-attention are learned in the decoder. The region of interest plays a role in detection guidance in the decoder. Self-attention processing is performed on the region of interest part in the feature map sequence to obtain the final target detection result.

[0104] The multi-sensor fusion target detection system provided in this embodiment of the invention includes:

[0105] The radar target frame association module is used to associate radar target frames using a Kalman filter algorithm through dual channels of radar point projection and visual image.

[0106] The camera intrinsic and extrinsic parameter calibration module is used to calibrate the camera intrinsic and extrinsic parameters using the Zhang Zhengyou calibration method, which is a feature extraction network that uses the Resnet-50 backbone as the feature map.

[0107] The spatial / temporal correlation module is used to achieve spatial correlation by projecting radar points onto the image through coordinate system transformation, and to achieve temporal correlation using Lagrange interpolation.

[0108] The fusion target detection module is used to obtain the final target detection result by performing self-attention and cross-attention processing on the region of interest generated by radar points.

[0109] To demonstrate the inventiveness and technical value of the technical solution of this invention, this section provides specific product or related technology application examples of the technical solution claimed.

[0110] This invention is applied to a smart road system. The hardware system mainly consists of IoT devices for data collection. The software system comprises a server and a client. The server is primarily responsible for processing sensor data and pre-delineating detection zones based on the needs of security personnel. Simultaneously, camera and radar sensors are integrated to detect targets within the detection zones. Upon the appearance of unfamiliar personnel or vehicles within the zones, the system immediately issues an audible and visual alarm and takes photos or videos of the detected targets. Alarm information, photos, and videos are stored in the server's database and displayed on the client. The client can interact with the server to obtain real-time information from the smart road system, including alarm information, photos, and videos.

[0111] As a preferred embodiment, such as Figure 2 As shown, the multi-sensor fusion target detection method provided in this embodiment of the invention specifically includes the following steps:

[0112] Step 1: In radar target tracking, the radar target tracking algorithm based on the extended Kalman filter algorithm is used. Since the radial velocity of the radar target, the distance and angle to the obstacle are all nonlinear models, and the process noise and observation noise both satisfy the Gaussian distribution, the relevant motion state prediction and update equations can be established based on the radar state transition matrix and noise parameters, which can effectively realize the association of radar targets between frames and obtain the radar ID.

[0113] The radar target tracking algorithm based on the extended Kalman filter algorithm provided in this embodiment of the invention includes:

[0114] (1) Derive the radar measurement function;

[0115] Millimeter-wave radar returns the following types of data: ρ represents the distance the radar reaches the obstacle. This refers to the obstacle's orientation and angle. Note that... Rotating counterclockwise from the x-axis is positive, so in this case, the actual measured angle is... Actually, it's negative.

[0116] radial velocity This involves projecting the velocity v onto the radar path. Therefore, processing radar data first requires converting polar coordinates to Cartesian coordinates. The specific conversion formula is as follows:

[0117] The distance ρ is the distance from the radar to the obstacle, and can be defined as:

[0118]

[0119] The angle between ρ and the x-direction can be defined as:

[0120]

[0121] radial velocity Then it is defined as:

[0122]

[0123] From this, the measurement function of millimeter-wave radar can be obtained:

[0124]

[0125] At this point, it can be observed that this transformation function is not a linear function. Gaussian distributions cannot be applied to nonlinear measurements, and therefore Kalman filtering cannot be used. To solve this problem, extended Kalman slow filtering is required.

[0126] (2) Jacobian matrix of extended Kalman filter;

[0127]

[0128] (3) Predict the motion state of the radar;

[0129] Suppose this invention knows the current position and velocity of a radar target, denoted as ρ. This invention can predict the object's state one second later because it knows the object's position and velocity one second ago, and can assume the object maintains the same velocity. The prediction calculation is performed using the ρ′ function. However, the object may not maintain the exact same velocity; it may change direction, accelerate, or decelerate. Therefore, when this invention predicts the state one second later, the uncertainty increases. The prediction equation is as follows:

[0130] ρ′=f(ρ,u)

[0131] P′=FPF T +Q

[0132] (4) Update radar motion status;

[0133] In the update step, this invention uses a measurement function to map the state vector to the sensor's measurement space. For a concrete example, the lidar only measures the object's position. However, the extended Kalman filter can simulate both the object's position and velocity, so multiplying by the measurement function matrix H will discard the velocity information in the state vector x. Then, the lidar-measured position can be compared with the object position predicted by this invention. The lidar update equation is shown below:

[0134] y′=zh(x′)

[0135] S=HP′H T +R

[0136] K = P'HT S -1

[0137] x=x′-Ky

[0138] P=(I-KH)P′

[0139] Step 2: Camera parameter calibration: such as Figure 5 As shown, the camera's intrinsic and extrinsic parameters were obtained by taking multiple images of the calibration board and using Zhang Zhengyou's chessboard annotation method. The camera was photographed from different angles and distances to obtain the camera's intrinsic and extrinsic parameters and to transform the camera coordinate system to the image coordinate system and pixel coordinate system. The obtained camera intrinsic and extrinsic parameters and scaling factor are shown in the following formulas:

[0140]

[0141] Step 3: Radar and Image Spatial Correlation: Using the intrinsic and extrinsic parameter matrices obtained in Step 2, preliminary camera calibration is completed. The purpose is to achieve the transformation between the image coordinate system and the world coordinate system. The world coordinate system is established with the camera as the origin, such as... Figure 4 As shown. The 0° direction of the camera is chosen as the y-axis of the world coordinate system, and the 90° clockwise direction of the y-axis is chosen as the x-axis of the world coordinate system. The Zhang Zhengyou calibration method is used to obtain the ratio of image coordinates to world coordinates and the camera's intrinsic and extrinsic parameter matrices. Therefore, the correspondence between the image coordinate system and the world coordinate system can be obtained, and their relationship is described by the following equation:

[0142]

[0143] Where H represents the product of intrinsic and extrinsic parameter matrices, Z represents the scaling factor (the ratio of pixel coordinates to world coordinates), u and v represent pixel coordinates, and U and V represent the corresponding world coordinates. The world coordinates corresponding to the pixel coordinates can be obtained through coordinate transformation. This completes the transformation between the pixel coordinate system and the world coordinate system. The radar coordinate system and the world coordinate system lie on the same plane, with the radar's position as the origin, the 0° direction facing the radar as the Y-axis, and the direction perpendicular to the radar's front as the X-axis, consistent with the world coordinate system. Radar detection does not provide information about the receiving height, which increases the difficulty of data fusion. It is assumed that the three-dimensional coordinates of the radar detection are returned from the ground where the vehicle is traveling. The projection is then extended in a direction perpendicular to this plane to account for the vertical extension of the object being detected. This invention assumes that the radar detection height extension is 3 meters, and the horizontal width of the object is assumed to be 2 meters.

[0144] Step 4: Radar and Image Time Correlation: Because radar and cameras are heterogeneous sensors, there is a difference in the timing of data acquisition between them. For example, in the case of Xiluo Circuit, the millimeter-wave radar acquires data at a frequency of 20Hz, while the Hikvision camera acquires image data at a frequency of 25Hz. The radar data includes the two-dimensional position coordinates of each object, its velocities Vx, Vy, and id in the x and y directions, while the camera only provides image information. To maintain consistency between the radar data sampling frequency and the video frame, it is assumed that the radar target's velocity remains constant over a short period.

[0145] The position information of each radar target point is interpolated to resample the radar data. Interpolation is performed between two radar data points, with an interval of 50ms. Within such a short time, the vehicle speed changes very little. Therefore, as... Figure 3 As shown, this invention selects a linear interpolation method. Assuming the radar data x-coordinates at t = 50 ms and t = 100 ms are known (the y-coordinate and velocity difference method is the same as for the x-axis), the interpolation relationship in the following equation can be obtained:

[0146]

[0147] Step 5: Generate Region of Interest (ROI): After projecting radar points onto the image, the pixels near the radar points are selected as the ROI. The generated ROI will serve as the focus area for self-attention and cross-attention. The ROI generated in this step facilitates the implementation of the attention-based object detection method in Step 6.

[0148] The method for generating regions of interest provided in this embodiment of the invention involves steps 1 to 4, which associate camera pixel radar data. The radar data is mapped onto the image plane at a width of 2 meters and a height of 3 meters. The projected area is the region of interest, which will be used as the input to the decoder in the Transformer. This completes the spatial association between the radar and image data.

[0149] Step 6: Attention-based target detection method: After the radar target tracking, radar point and image temporal and spatial alignment and generation of region of interest in the first five steps, feature maps of different dimensions and scales are generated by the ResNet-50 backbone convolutional neural network. The feature maps are translated into a feature map sequence and input into the encoder to extract features from the feature map sequence. Self-attention and cross-attention are learned in the decoder. The region of interest plays a role in detection guidance in the decoder. Self-attention processing is performed on the region of interest part in the feature map sequence to obtain the final target detection result.

[0150] The specific steps of the attention-based target detection method provided in this embodiment of the invention are as follows:

[0151] (1) Generate multi-scale feature maps

[0152] After fusing radar and images, feature maps are extracted using a ResNet-50 backbone convolutional neural network. By obtaining multi-scale feature maps based on different dimensional inputs, multi-scale information can be obtained, which can improve the detection accuracy of small targets.

[0153] (2) Self-attention mechanism based on radar points

[0154] The self-attention mechanism learns the relationships between pixels near the radar projection point, focusing the detector's attention on the vicinity and local area of the radar point. This allows for the rapid capture of useful information near the radar projection point, improving target detection accuracy while avoiding detection overhead and waste, thus increasing detection efficiency. In other words, the self-attention mechanism based on radar projection points utilizes the features of the radar point's location to help detect the image's category and location information, making the detected image's location and category accuracy more robust.

[0155] (3) Multi-head cross-attention mechanism based on radar points

[0156] Self-attention mechanisms learn the relationships between pixels near radar projection points, while cross-attention can obtain reasoning information about the global context of radar projection points. In traditional attention-based object detection, the decoder's position and category queries are selected dynamically and randomly. This dynamic randomness greatly reduces detection efficiency and accuracy. However, the region of interest generated by the radar projection points in the image can be used as an auxiliary decoder for position and category queries, making the decoder pay more attention to the region of interest generated by the radar projection points for category and position detection. This improves the robustness and accuracy of object detection and increases the model's convergence time.

[0157] (4) Bipartite graph matching loss function based on Hungarian algorithm

[0158] The decoder predicts a fixed set of N predictions in one pass, where N is set to be significantly larger than the typical number of objects in the image. Assuming the radar projection onto the image generates m regions of interest (ROIs), Nm random predictions are generated based on these m ROIs to complete the N predictions. Finally, the decoder outputs N predictions, each containing a binary tuple (c, box) representing the detected class and location, where c represents the object class and box represents the location of the detected bounding box in the image. Then, the Hungarian algorithm is used for bipartite graph matching, which maps each element of the prediction set to the ground truth set, calculates the loss between each prediction and the ground truth set, and minimizes the total matching loss.

[0159] Since the detection results consist of a class and a location pair, the loss value for each prediction result, as shown in the following equation, comprises two parts: a class loss and a location loss, where represents a linear combination of the two losses. Finally, as described in the equation, all optimal solutions are matched using the Hungarian algorithm.

[0160]

[0161]

[0162] Step 7: Attention-based multi-sensor fusion target detection system: such as Figure 6 As shown, this embodiment of the invention designs and implements a smart sentinel security system based on the proposed multi-sensor fusion target detection method based on an attention mechanism. This smart sentinel security system is mainly designed for security scenarios. It uses radar and cameras as data inputs to achieve intrusion detection and real-time alarm functions in designated areas.

[0163] This invention integrates radar and video surveillance information on the speed, distance, azimuth, and direction of movement of targets within the monitored area. It intelligently drives a high-definition, high-speed network intelligent PTZ camera to perform real-time dynamic tracking and intelligent zoom-based clear capture and verification of intrusion targets. Simultaneously, it proactively issues real-time audible and visual alarm prompts, clearly displays the target's movement trajectory and intrusion scene, and enables the radar to automatically detect and identify suspicious intrusion targets, automatically activate cameras to track and monitor suspicious targets, and automatically generate audible and visual alarms and SMS alarms. This significantly improves the work efficiency of surveillance personnel and allows for target detection and alarm processing of pedestrians within the monitored area.

[0164] It should be noted that embodiments of the present invention can be implemented in hardware, software, or a combination of both. The hardware portion can be implemented using dedicated logic; the software portion can be stored in memory and executed by a suitable instruction execution system, such as a microprocessor or dedicated-design hardware. Those skilled in the art will understand that the above-described devices and methods can be implemented using computer-executable instructions and / or included in processor control code, for example, such code provided on a carrier medium such as a disk, CD, or DVD-ROM, a programmable memory such as read-only memory (firmware), or a data carrier such as an optical or electronic signal carrier. The devices and modules of the present invention can be implemented by hardware circuitry such as very large-scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, or programmable hardware devices such as field-programmable gate arrays, programmable logic devices, etc., or by software executed by various types of processors, or by a combination of the above-described hardware circuitry and software, such as firmware.

[0165] The above description is merely a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any modifications, equivalent substitutions, and improvements made by those skilled in the art within the scope of the technology disclosed in the present invention, and within the spirit and principles of the present invention, should be covered within the scope of protection of the present invention.

Claims

1. A multi-sensor fusion target detection method, characterized in that, The multi-sensor fusion target detection method includes: using radar point projection and dual image channels, employing the Kalman filter algorithm to correlate radar target frames, using the ResNet-50 backbone as the feature extraction network for the feature map, and using the Zhang Zhengyou calibration method to calibrate the camera's intrinsic and extrinsic parameters; achieving spatial correlation by projecting radar points onto the image through coordinate system transformation, and achieving temporal correlation using Lagrange interpolation; and obtaining the final target detection result by performing self-attention and cross-attention processing on the regions of interest generated by the radar points. The multi-sensor fusion target detection method includes the following steps: Step 1: Construct a radar target tracking algorithm based on the extended Kalman filter algorithm, and establish relevant motion state prediction and update equations based on the radar's state transition matrix and noise parameters; Step 2, Camera parameter calibration: Match the points in the world coordinate system with the points in the pixel coordinate system to obtain the camera's intrinsic and extrinsic parameters, which are used for conversion between the camera coordinate system and the world coordinate system; Step 3, Spatial association between radar and image: Using intrinsic and extrinsic parameter matrices, the world coordinate system is transformed to the image coordinate system, and the radar points are transformed to the image coordinate system, thus realizing the spatial association between radar points and the image; Step 4, radar and image temporal correlation: The millimeter-wave radar inputs the detection results every 20Hz in the two-dimensional coordinate system, and performs Lagrange interpolation on each vehicle trajectory to resample the radar data; Step 5, Region of Interest Generation: After projecting the radar points onto the image, the pixels near the radar points are taken as the region of interest, and the region of interest is taken as the focus area for self-attention and cross-attention. Step 6, Attention-based target detection: Feature extraction is performed using a ResNet-50 backbone convolutional neural network to generate feature maps, and the final target detection result is obtained through self-attention and cross-attention processing; In step six, after radar target tracking, temporal and spatial alignment of radar points and images, and generation of regions of interest, feature maps of different dimensions and scales are generated through a ResNet-50 backbone convolutional neural network. The feature maps are translated into a sequence of feature maps, which is then input into the encoder for feature extraction. Self-attention and cross-attention are learned in the decoder. Self-attention processing is applied to the regions of interest in the feature map sequence to obtain the final target detection result. The process involves a decoder predicting N predictions of a fixed size, where N is set to be significantly larger than the typical number of objects in the image. When radar projects onto a point in the image, it generates m regions of interest (ROIs), and Nm random predictions are generated based on these m ROIs. The decoder outputs N prediction results, each containing a tuple (c, box) representing the detected category and location, where c represents the object category and box represents the location of the detected bounding box in the image. A Hungarian algorithm is used for bipartite graph matching, mapping elements between the prediction and ground truth sets, calculating the loss between each prediction and the ground truth result, and minimizing the total matching loss. Each detection result consists of a category and location tuple, and the loss value for each prediction result comprises a category loss and a loss function, representing a linear combination of the two losses. Finally, the Hungarian algorithm is used to match all optimal solutions. The expression for the detection result is as follows: ；； The loss value for each prediction result is composed of category loss. and box position loss It consists of two parts. Represents the actual value. Indicates the predicted value. Indicates the predicted location of the target. This indicates that the target category being identified is an empty set. The value is 0 when the condition is met, and 0 otherwise. The optimal loss is then calculated by matching all target values using the Hungarian algorithm. .

2. The multi-sensor fusion object detection method of claim 1, wherein, The radar target tracking algorithm based on the extended Kalman filter algorithm in step one includes: (1) Derive the radar measurement function ρ represents the distance the radar reaches the obstacle, φ is the obstacle's azimuth angle, φ is positive when rotated counterclockwise from the x-axis, but the actual measured angle φ is negative; radial velocity. The velocity v is projected onto the radar line. The radar data is then processed to convert polar coordinates to Cartesian coordinates. The conversion formula is as follows: Distance ρ is the distance from the radar to the obstacle, defined as: ； φ is the angle between ρ and the x-direction, defined as: ； radial velocity Then it is defined as: ； The measurement function obtained from the millimeter-wave radar is: ； (2) Construct the Jacobian matrix of the extended Kalman filter. ； (3) Predicting the motion state of the radar When the current position and velocity of the radar target are x, and the object maintains the same velocity while moving, the prediction equation is as follows, calculated using the function x' = Fx + ν: ；； (4) Update radar motion status By using the measurement function to map the state vector to the sensor's measurement space, and comparing the lidar's measured position with the predicted object position, the lidar update equation is as follows: ；；；；。 3.The multi-sensor fusion target detection method of claim 1, wherein, Step two, camera parameter calibration, includes: obtaining the camera's intrinsic and extrinsic parameters by taking multiple images of the calibration board and using Zhang Zhengyou's chessboard annotation method; taking photos of the chessboard grid from different angles and distances to obtain the camera's intrinsic and extrinsic parameters, and transforming the camera coordinate system to the image coordinate system and pixel coordinate system to obtain the camera's intrinsic and extrinsic parameters and scaling factor, as shown in the following formula: 。 4. The multi-sensor fusion target detection method as described in claim 1, characterized in that, By using the intrinsic and extrinsic parameter matrices obtained in step two, the initial camera calibration is completed, and the transformation between the image coordinate system and the world coordinate system is realized. The world coordinate system is established with the camera as the origin. After transforming the radar relative coordinate system to the world coordinate system, the radar points are then transformed to the image coordinate system. Choosing the 0° direction of the camera as the y-axis of the world coordinate system, and the 90° clockwise direction of the y-axis as the x-axis of the world coordinate system, the Zhang Zhengyou calibration method is used to obtain the ratio of image coordinates to world coordinates and the camera's intrinsic and extrinsic parameter matrices. The correspondence between the image coordinate system and the world coordinate system is then obtained, described by the following equation: ； Where H represents the product of intrinsic and extrinsic parameter matrices, Z represents the scaling factor between pixel coordinates and world coordinates, u and v represent pixel coordinates, and U and V represent the corresponding world coordinates. The world coordinates corresponding to the pixel coordinates are obtained through coordinate transformation, thus completing the transformation between the pixel coordinate system and the world coordinate system.

5. A multi-sensor fusion target detection system applying the multi-sensor fusion target detection method as described in any one of claims 1 to 4, characterized in that, The multi-sensor fusion target detection system includes: The radar target frame association module is used to associate radar target frames using a Kalman filter algorithm through dual channels of radar point projection and visual image. The camera intrinsic and extrinsic parameter calibration module is used to calibrate the camera intrinsic and extrinsic parameters using the Zhang Zhengyou calibration method, which is a feature extraction network that uses the Resnet-50 backbone as the feature map. The spatial / temporal correlation module is used to achieve spatial correlation by projecting radar points onto the image through coordinate system transformation, and to achieve temporal correlation using Lagrange interpolation. The fusion target detection module is used to obtain the final target detection result by performing self-attention and cross-attention processing on the region of interest generated by radar points.

6. A computer device, comprising: The computer device includes a memory and a processor. The memory stores a computer program, which, when executed by the processor, causes the processor to perform the steps of the multi-sensor fusion target detection method as described in any one of claims 1 to 4.

7. A computer-readable storage medium storing a computer program, which, when executed by a processor, causes the processor to perform the steps of the multi-sensor fusion target detection method as described in any one of claims 1 to 4.

8. An information data processing terminal, characterized by The information data processing terminal is used to implement the multi-sensor fusion target detection system as described in claim 5.

Citation Information

Patent Citations

Millimeter wave radar and vision fused three-dimensional target detection method based on attention mechanism
CN114708585A
Automatic driving target detection and tracking method based on multi-source heterogeneous information fusion
CN115471526A

Patent Information

AI Technical Summary

Abstract

Description

Patent Citations

Millimeter wave radar and vision fused three-dimensional target detection method based on attention mechanism

Automatic driving target detection and tracking method based on multi-source heterogeneous information fusion