Training of a target detection model and target detection method and apparatus

By fusing features acquired by image acquisition devices and remote sensing devices, the problem of feature misalignment between sensors is solved, thereby improving the detection accuracy and precision of the target detection model.

CN117152560BActive Publication Date: 2026-06-26JIUZHI (SUZHOU) INTELLIGENT TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
JIUZHI (SUZHOU) INTELLIGENT TECH CO LTD
Filing Date
2023-09-06
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing target detection models based on point cloud and image fusion suffer from feature misalignment due to biases and errors between sensors, affecting the accuracy of target detection results.

Method used

The dense features of multi-view images and the sparse features of point cloud data are determined by image acquisition equipment and remote sensing equipment, respectively. The features are then fused to form sample fused image features, which are used to train the target detection model.

Benefits of technology

It enhances the feature fusion effect and improves the detection accuracy and result accuracy of the target detection model.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117152560B_ABST
    Figure CN117152560B_ABST
Patent Text Reader

Abstract

The application discloses a kind of training and target detection method and device of target detection model, belong to the field of automatic driving, the method includes: according to the sample multi-view image in the sample scene where the image acquisition equipment is collected in the sample scene where the automatic driving vehicle is located, determine sample image dense feature;According to the sample point cloud data in the sample scene where the automatic driving vehicle is located that remote sensing detection equipment is collected, determine sample image sparse feature;Sample image dense feature and sample image sparse feature are fused, and sample fusion image feature is obtained;According to sample fusion image feature, target detection model is trained.The application solves the problem that feature is not aligned, enhances the feature fusion effect, and then improves the accuracy of subsequent target detection result.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of autonomous driving technology, and in particular to a training method and apparatus for object detection model and object detection. Background Technology

[0002] 3D object detection enables intelligent agents to effectively perceive the real environment, classify and locate 3D objects, and has important application value in the field of autonomous driving.

[0003] However, existing target detection models based on point cloud and image fusion suffer from feature misalignment due to deviations in the mutual projection between sensors on autonomous vehicles, such as dynamic changes in the vehicle's own position, errors in the calibration relationship between sensors, and differences in the triggering time between sensors. This affects the accuracy of subsequent target detection results. Summary of the Invention

[0004] This invention provides a training method and apparatus for target detection, which solves the problem of feature misalignment, enhances feature fusion effect, and improves the accuracy of subsequent target detection results.

[0005] According to one aspect of the present invention, a method for training an object detection model is provided, the method comprising:

[0006] Based on the multi-view images of the sample scene where the autonomous vehicle is located, which are acquired by the image acquisition device, the dense features of the sample images are determined.

[0007] Based on the sample point cloud data of the sample scene where the autonomous vehicle is located, collected by remote sensing equipment, the sparse features of the sample image are determined.

[0008] The dense and sparse features of the sample images are fused to obtain the fused image features.

[0009] The target detection model is trained based on the features of the fused sample images.

[0010] According to another aspect of the present invention, a target detection method is provided, the method comprising:

[0011] Acquire multi-view images and point cloud data of the target in the target scene where the autonomous vehicle is located;

[0012] A target detection model is used to detect targets in multi-view images and point cloud data to obtain a second predicted target in the target scene; wherein the target detection model is trained based on the training method of the target detection model in any one of claims 1-5.

[0013] According to another aspect of the present invention, a training apparatus for an object detection model is provided, the apparatus comprising:

[0014] The image density feature determination module is used to determine the density features of the sample images based on the multi-view images of the sample scene where the autonomous vehicle is located, which are acquired by the image acquisition device.

[0015] The image sparsity feature determination module is used to determine the sparsity features of the sample image based on the sample point cloud data in the sample scene where the autonomous vehicle is located, collected by the remote sensing equipment.

[0016] The fused image feature determination module is used to fuse dense features and sparse features of sample images to obtain fused image features.

[0017] The object detection model training module is used to train the object detection model based on the features of the fused image samples.

[0018] According to another aspect of the present invention, a target detection device is provided, the device comprising:

[0019] The data acquisition module is used to acquire multi-view images and point cloud data of the target in the target scene where the autonomous vehicle is located;

[0020] The target prediction module is used to perform target detection on multi-view images and point cloud data of the target using a target detection model to obtain a second predicted target in the target scene; wherein the target detection model is trained based on the training method of the target detection model in any one of claims 1-5.

[0021] According to another aspect of the present invention, an electronic device is provided, the electronic device comprising:

[0022] At least one processor; and

[0023] A memory that is communicatively connected to at least one processor; wherein,

[0024] The memory stores a computer program that can be executed by at least one processor, such that the at least one processor is able to execute a training method for a target detection model according to any embodiment of the present invention, and / or a target detection method.

[0025] According to another aspect of the present invention, a computer-readable storage medium is provided, the computer-readable storage medium storing computer instructions for causing a processor to execute and implement a training method for a target detection model of any embodiment of the present invention, and / or a target detection method.

[0026] The technical solution of this invention determines dense features of sample images based on multi-view images of the sample scene where the autonomous vehicle is located, acquired by an image acquisition device; and determines sparse features of sample images based on point cloud data of the sample scene where the autonomous vehicle is located, acquired by a remote sensing device. The dense and sparse features of the sample images are then fused to obtain fused image features. A target detection model is trained based on these fused image features. This technical solution, by fusing dense and sparse features of sample images from the same viewpoint and dimension, solves the problem of feature misalignment in the determination process of fused image features, enhances the feature fusion effect, and results in higher target detection accuracy of the target detection model trained based on the fused image features, thereby improving the accuracy of subsequent target detection results.

[0027] It should be understood that the description in this section is not intended to identify key or essential features of the embodiments of the present invention, nor is it intended to limit the scope of the invention. Other features of the invention will become readily apparent from the following description. Attached Figure Description

[0028] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0029] Figure 1 This is a flowchart of a training method for an object detection model according to Embodiment 1 of the present invention;

[0030] Figure 2 This is a flowchart of a training method for an object detection model according to Embodiment 2 of the present invention;

[0031] Figure 3 This is a flowchart of a target detection method provided in Embodiment 3 of the present invention;

[0032] Figure 4 This is a schematic diagram of the structure of a training device for a target detection model according to Embodiment 4 of the present invention;

[0033] Figure 5 This is a schematic diagram of the structure of a target detection device according to Embodiment 5 of the present invention;

[0034] Figure 6 This is a schematic diagram of the training method for the target detection model implemented in the embodiments of the present invention, and / or the structure of the electronic device for the target detection method. Detailed Implementation

[0035] To enable those skilled in the art to better understand the present invention, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of the present invention.

[0036] It should be noted that the terms "target," "sample," "first," and "second," etc., used in the specification, claims, and accompanying drawings of this invention are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that embodiments of the invention described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover a non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.

[0037] Furthermore, it should be noted that the collection, storage, use, processing, transmission, provision, and disclosure of sample multi-view images, sample point cloud data, target multi-view images, target point cloud data, and supervision data in the sample scene involved in the technical solution of the present invention all comply with the provisions of relevant laws and regulations and do not violate public order and good morals.

[0038] Example 1

[0039] Figure 1 This is a flowchart of a training method for an object detection model provided in Embodiment 1 of the present invention. This embodiment is applicable to optimizing object detection models used in autonomous driving scenarios. The method can be executed by a training device for the object detection model, which can be implemented in hardware and / or software and can be configured in an electronic device. Figure 1 As shown, the method includes:

[0040] S101. Based on the multi-view images of the sample scene where the autonomous vehicle is located, collected by the image acquisition device, determine the dense features of the sample images.

[0041] In this context, "image acquisition device" refers to a device used to acquire images of the scene where the autonomous vehicle is located; it can be a wide-angle camera or an infrared camera. "Sample scene" refers to the autonomous driving scene required for training the object detection model; for example, a sample scene could be a road intersection. "Sample multi-view images" refers to multi-view images required for training the object detection model. Multi-view images refer to images acquired from different perspectives at the same time; for example, multi-view images could be images acquired simultaneously from the front, rear, left, and right perspectives of the autonomous vehicle. It should be noted that the number of image acquisition devices can be preset according to actual business needs; for example, the number of image acquisition devices could be 4 or 6. This embodiment of the invention does not impose a specific limitation. Correspondingly, the installation location of the image acquisition devices on the autonomous vehicle can also be preset according to actual business needs; this embodiment of the invention does not impose a specific limitation, but it must ensure 360-degree full coverage of the scene where the autonomous vehicle is located. "Sample image dense features" refers to the image dense features required for training the object detection model. Image dense features refer to the image features obtained after processing the images acquired by the image acquisition devices. It should be noted that object detection models refer to models used to detect objects of interest in the scene where autonomous vehicles are located, and to determine the category and location of the objects of interest.

[0042] Specifically, multi-view images of the sample scene where the autonomous vehicle is located, acquired by the image acquisition device, can be input into the feature extraction network. After processing by the feature extraction network, dense features in the multi-view images of the samples can be obtained. Then, the dense features are projected onto the radar coordinate system to obtain the dense features of the three-dimensional image. Furthermore, convolutional coding processing is performed on the dense features of the three-dimensional image to obtain the dense features of the sample image.

[0043] S102. Based on the sample point cloud data of the sample scene where the autonomous vehicle is located, collected by the remote sensing equipment, determine the sparse features of the sample image.

[0044] In this context, "remote sensing equipment" refers to devices used for long-distance detection of the scene where autonomous vehicles are located, and can be lidar. "Sample point cloud data" refers to the point cloud data required for training the target detection model. Specifically, point cloud data refers to the data collected by the remote sensing equipment. "Sample image sparse features" refers to the image sparse features required for training the target detection model. Specifically, image sparse features refer to the image features obtained after processing the point cloud data collected by the remote sensing equipment.

[0045] Specifically, the point cloud data of the autonomous vehicle in the sample scene collected by the remote sensing equipment can be voxelized to obtain the voxel features of the point cloud data; then, the voxel features can be convolutionally encoded to obtain the sparse features of the sample image.

[0046] S103. Fuse the dense features and sparse features of the sample image to obtain the sample fused image features.

[0047] Among them, sample fusion image features refer to the image features obtained by fusing dense features and sparse features of sample images.

[0048] Specifically, the fusion algorithm based on convolutional neural networks fuses dense features and sparse features of sample images to obtain fused image features.

[0049] Understandably, fusing dense and sparse features of sample images allows the resulting fused image features to contain both the rich semantic information from the dense features and the rich depth information from the sparse features, thus enabling full utilization of multi-view images and point cloud data from the sample scene where the autonomous vehicle is located.

[0050] S104. Train the target detection model based on the features of the sample fusion image.

[0051] Optionally, a target prediction network in the target detection model can be used to predict the target based on the sample fused image features to obtain the first predicted target in the sample scene; the target detection model can then be trained based on the first predicted target and the supervised data in the sample scene.

[0052] The object prediction network refers to the prediction network in the object detection model, used to detect objects of interest in the sample scene where the autonomous vehicle is located; optionally, the object prediction network may include a detection head and a segmentation head. The first predicted target refers to the predicted target in the sample scene. Supervised data refers to labeled data in the sample scene.

[0053] Specifically, the fused image features of the samples can be input into the target prediction network of the target detection model. After processing by the target prediction network, the first predicted target in the sample scene is obtained. The target detection model is then jointly trained using the first predicted target and the supervised data in the sample scene until the training loss reaches a set range or the number of training iterations reaches a set number. At this point, training of the target detection model is stopped, and the target detection model at the point of training stoppage is taken as the final target detection model. The set range and the set number of iterations can be preset according to actual business needs, and this embodiment of the invention does not impose specific limitations on them.

[0054] Understandably, by using semi-supervised learning to train the object detection model using unlabeled first prediction targets and labeled supervised data, the model's need for labeled data is reduced, and its detection capability is improved.

[0055] The technical solution of this invention determines dense features of sample images based on multi-view images of the sample scene where the autonomous vehicle is located, acquired by an image acquisition device; and determines sparse features of sample images based on point cloud data of the sample scene where the autonomous vehicle is located, acquired by a remote sensing device. The dense and sparse features of the sample images are then fused to obtain fused image features. A target detection model is trained based on these fused image features. This technical solution, by fusing dense and sparse features of sample images from the same viewpoint and dimension, solves the problem of feature misalignment in the determination process of fused image features, enhances the feature fusion effect, and results in higher target detection accuracy of the target detection model trained based on the fused image features, thereby improving the accuracy of subsequent target detection results.

[0056] Example 2

[0057] Figure 2 This is a flowchart of a training method for a target detection model provided in Embodiment 2 of the present invention. Based on the above embodiments, this embodiment further optimizes the step of "determining the dense features of sample images based on multi-view images of samples in the sample scene where the autonomous vehicle is located, acquired by the image acquisition device," providing an optional implementation scheme. It should be noted that parts not detailed in this embodiment can be referred to in the relevant descriptions of other embodiments. For example... Figure 2 As shown, the method includes:

[0058] S201. Extract features from the multi-view images of the sample scene where the autonomous vehicle is located, which are acquired by the image acquisition device, to obtain the first image features.

[0059] The first image feature refers to the image features obtained after feature extraction from the multi-view images of the sample.

[0060] Specifically, a feature pyramid network can be used to extract features from multi-view images of the sample scene where the autonomous vehicle is located, which are acquired by the image acquisition device, to obtain the first image features.

[0061] S202. Perform three-dimensional feature transformation on the first image features to obtain three-dimensional image features.

[0062] Among them, three-dimensional image features refer to the pseudo voxel features of the first image features under the bird's eye view (BEV).

[0063] Specifically, a depth image classifier can be used to perform depth density prediction on the first image features to obtain the corresponding depth image features. Then, the extrinsic parameters of the image acquisition device (e.g., those of a wide-angle camera) and the depth image features are input into a 3D projection model. After processing by the 3D projection model, the 3D image features are obtained. The 3D projection model projects the first image features onto the radar coordinate system, ensuring that the first image features correspond to each point in the sample point cloud data.

[0064] S203. Encode the three-dimensional image features to obtain the dense features of the sample image.

[0065] Specifically, convolutional coding is performed on the features of the three-dimensional image to obtain dense features of the sample image.

[0066] S204. Based on the sample point cloud data of the sample scene where the autonomous vehicle is located, collected by the remote sensing equipment, determine the sparse features of the sample image.

[0067] Optionally, the sample point cloud data in the sample scene where the autonomous vehicle is located, collected by the remote sensing equipment, is processed into voxels to obtain voxel features; the voxel features are fused with the first image features to obtain fused voxel features; and the fused voxel features are encoded to obtain sparse features of the sample image.

[0068] Among them, the fused voxel feature refers to the voxel feature obtained by fusing the voxel feature and the first image feature.

[0069] Specifically, the point cloud data of the sample scene where the autonomous vehicle is located, collected by the remote sensing equipment, is voxelized to obtain voxel features; based on a multimodal fusion algorithm, the voxel features are fused with the first image features to obtain fused voxel features; the fused voxel features are then convolutionally encoded to obtain sparse features of the sample image. The multimodal fusion algorithm can be pre-set according to actual business needs; for example, it can be an attention-based multimodal fusion algorithm, and this embodiment of the invention does not specifically limit it.

[0070] Optionally, the voxel features are fused with the first image features to obtain fused voxel features. This can be achieved by: determining the region image features corresponding to the voxel features from the first image features; performing a deviation operation on the region image features and the voxel features to obtain the deviation amount of the voxel features; and determining the fused voxel features based on the voxel features and the deviation amount.

[0071] Among them, the region image feature refers to the image feature corresponding to the projection area of ​​the voxel feature in the camera coordinate system in the first image feature.

[0072] Specifically, for each voxel feature, based on the projection area of ​​the voxel feature in the camera coordinate system, the image features in the projection area are extracted from the first image features as the region image features; based on a deformable convolutional network, the deviation operation is performed on the region image features and voxel features to obtain the deviation amount of the voxel features; a cross-attention network is used to fuse the voxel features and the deviation amount to obtain the fused voxel features.

[0073] It is understandable that determining the region image features corresponding to the voxel features from the first image features is to align the first image features and the voxel features; performing deviation calculations on the region image features and voxel features based on deformable convolutional networks to obtain the deviation amount of the voxel features is to extract more feature information; and using a cross-attention network to fuse the voxel features and the deviation amount to obtain fused voxel features is to enhance the fusion strength of the voxel features and the deviation amount.

[0074] S205. Fuse the dense features and sparse features of the sample image to obtain the sample fused image features.

[0075] S206. Train the target detection model based on the features of the sample fusion image.

[0076] The technical solution of this invention provides a method for specifically determining the dense features and sparse features of a sample image, obtaining the dense features and sparse features of the sample image of the same dimension under the same viewpoint, thus ensuring the accuracy of subsequent sample fusion features.

[0077] Example 3

[0078] Figure 3 This is a flowchart of a target detection method provided in Embodiment 3 of the present invention. This embodiment is applicable to the situation of target detection in the scene where an autonomous vehicle is located. The method can be executed by a target detection device, which can be implemented in hardware and / or software and can be configured in an electronic device. Figure 3 As shown, the method includes:

[0079] S301. Acquire multi-view images and point cloud data of the target in the target scene where the autonomous vehicle is located.

[0080] Here, "target scene" refers to the actual scene in which the autonomous vehicle is located. "Target multi-view images" refers to images acquired simultaneously from different perspectives within the target scene where the autonomous vehicle is located. "Target point cloud data" refers to point cloud data acquired within the target scene where the autonomous vehicle is located.

[0081] Specifically, multi-view images of targets in the target scene where the autonomous vehicle is located are acquired through image acquisition devices installed on the autonomous vehicle; point cloud data of targets in the target scene where the autonomous vehicle is located are acquired through remote sensing detection devices installed on the autonomous vehicle. The image acquisition devices refer to devices used to collect images of the scene where the autonomous vehicle is located, and can be wide-angle cameras or infrared cameras. It should be noted that the number of image acquisition devices can be preset according to actual business needs; for example, the number of image acquisition devices can be 4 or 6. This embodiment of the invention does not impose a specific limitation. Correspondingly, the installation location of the image acquisition devices on the autonomous vehicle can also be preset according to actual business needs. This embodiment of the invention does not impose a specific limitation, but it must ensure 360-degree full coverage of the scene where the autonomous vehicle is located. The remote sensing detection devices refer to devices used for long-distance detection of the scene where the autonomous vehicle is located, and can be LiDAR.

[0082] S302. A target detection model is used to detect targets in multi-view images and point cloud data to obtain a second predicted target in the target scene; wherein the target detection model is trained based on the training method of the target detection model in any one of claims 1-5.

[0083] The second prediction target refers to the prediction target in the target scenario.

[0084] Specifically, multi-view images and point cloud data of the target are input into the target detection model. The feature extraction network within the model extracts features from the multi-view images to obtain second image features. These second image features undergo 3D feature transformation to obtain 3D image features. Finally, convolutional coding is applied to these 3D image features to obtain dense target image features. Here, the second image features refer to the image features obtained after feature extraction from the multi-view images. The 3D target image features refer to the pseudo-voxel features of the second image features from a bird's-eye view. The dense target image features refer to the image features obtained after convolutional coding of the 3D target image features.

[0085] Simultaneously, the target point cloud data is voxelized using a voxel transformation algorithm in the target detection model to obtain target voxel features. Based on a multimodal fusion algorithm, the target voxel features are fused with second image features to obtain fused target voxel features. Convolutional coding is then applied to the fused voxel features to obtain sparse features of the target image. Here, target voxel features refer to the voxel features obtained after voxelization of the target point cloud data. Sparse features of the target image refer to the image features obtained after convolutional coding of the fused target voxel features.

[0086] Subsequently, the dense and sparse features of the target image are fused through the feature fusion network in the target prediction model to obtain the target fusion features from the BEV perspective. Then, the target fusion features are predicted through the target prediction network in the target detection model to obtain the second predicted target in the target scene.

[0087] The technical solution of this invention performs target detection on the target scene where the autonomous vehicle is located based on a trained target detection model, and obtains a second predicted target in the target scene, making the predicted second predicted target more accurate.

[0088] Example 4

[0089] Figure 4 This is a schematic diagram of a training device for an object detection model provided in Embodiment 4 of the present invention. This embodiment is applicable to optimizing object detection models used in autonomous driving scenarios. The device can be implemented in hardware and / or software and can be configured in an electronic device. Figure 4 As shown, the device includes:

[0090] The image density feature determination module 401 is used to determine the density features of the sample image based on the multi-view images of the sample scene where the autonomous vehicle is located, which are acquired by the image acquisition device.

[0091] The image sparse feature determination module 402 is used to determine the sparse features of the sample image based on the sample point cloud data in the sample scene where the autonomous vehicle is located, collected by the remote sensing equipment.

[0092] The fused image feature determination module 403 is used to fuse dense features and sparse features of the sample image to obtain fused image features.

[0093] The object detection model training module 404 is used to train the object detection model based on the features of the sample fused image.

[0094] The technical solution of this invention determines dense features of sample images based on multi-view images of the sample scene where the autonomous vehicle is located, acquired by an image acquisition device; and determines sparse features of sample images based on point cloud data of the sample scene where the autonomous vehicle is located, acquired by a remote sensing device. The dense and sparse features of the sample images are then fused to obtain fused image features. A target detection model is trained based on these fused image features. This technical solution, by fusing dense and sparse features of sample images from the same viewpoint and dimension, solves the problem of feature misalignment in the determination process of fused image features, enhances the feature fusion effect, and results in higher target detection accuracy of the target detection model trained based on the fused image features, thereby improving the accuracy of subsequent target detection results.

[0095] Optionally, the image density feature determination module 401 is specifically used for:

[0096] Feature extraction is performed on multi-view images of samples in the sample scene where the autonomous vehicle is located, which are acquired by the image acquisition device, to obtain the first image features;

[0097] Perform 3D feature transformation on the first image features to obtain 3D image features;

[0098] The features of the three-dimensional image are encoded to obtain the dense features of the sample image.

[0099] Optionally, the image sparse feature determination module 402 includes:

[0100] The voxel feature determination unit is used to perform voxelization processing on the sample point cloud data of the sample scene where the autonomous vehicle is located, which is collected by the remote sensing equipment, to obtain voxel features.

[0101] The voxel feature determination unit is used to fuse voxel features with first image features to obtain fused voxel features;

[0102] The image sparse feature determination unit is used to encode the fused voxel features to obtain the sparse features of the sample image.

[0103] Optionally, the fusion voxel feature determination unit is specifically used for:

[0104] Determine the region image features corresponding to the voxel features from the first image features;

[0105] Deviation calculations are performed on the region features and voxel features to obtain the deviation of the voxel features;

[0106] The characteristics of the fusion voxels are determined based on voxel characteristics and deviation.

[0107] Optional, the object detection model training module 404 is specifically used for:

[0108] The target prediction network in the target detection model is used to predict the target in the sample fused image features to obtain the first predicted target in the sample scene.

[0109] The target detection model is trained based on the first prediction target and the supervised data in the sample scene.

[0110] The training apparatus for the target detection model provided in this embodiment of the invention can execute the training method of the target detection model provided in any embodiment of the invention, and has the corresponding functional modules and beneficial effects for executing the training method of each target detection model.

[0111] Example 5

[0112] Figure 5 This is a schematic diagram of a target detection device provided in Embodiment 5 of the present invention. This embodiment is applicable to the situation of target detection in the scene where an autonomous vehicle is located. The device can be implemented in hardware and / or software and can be configured in an electronic device. Figure 5 As shown, the method includes:

[0113] Data acquisition module 501 is used to acquire multi-view images and point cloud data of the target in the target scene where the autonomous vehicle is located;

[0114] The target prediction module 502 is used to perform target detection on the multi-view image of the target and the point cloud data of the target using a target detection model to obtain a second predicted target in the target scene; wherein the target detection model is trained based on the training method of the target detection model in any one of claims 1-5.

[0115] The technical solution of this invention performs target detection on the target scene where the autonomous vehicle is located based on a trained target detection model, and obtains a second predicted target in the target scene, making the predicted second predicted target more accurate.

[0116] The target detection device provided in the embodiments of the present invention can execute the target detection method provided in any embodiment of the present invention, and has the corresponding functional modules and beneficial effects for executing each target detection method.

[0117] Example 6

[0118] Figure 6A schematic diagram of an electronic device 10 that can be used to implement embodiments of the present invention is shown. The electronic device is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device can also represent various forms of mobile devices, such as personal digital processors, cellular phones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely illustrative and are not intended to limit the implementation of the invention described and / or claimed herein.

[0119] like Figure 6 As shown, the electronic device 10 includes at least one processor 11 and a memory, such as a read-only memory (ROM) 12 or a random access memory (RAM) 13, communicatively connected to the at least one processor 11. The memory stores computer programs executable by the at least one processor. The processor 11 can perform various appropriate actions and processes based on the computer program stored in the ROM 12 or loaded from storage unit 18 into the RAM 13. The RAM 13 can also store various programs and data required for the operation of the electronic device 10. The processor 11, ROM 12, and RAM 13 are interconnected via a bus 14. An input / output (I / O) interface 15 is also connected to the bus 14.

[0120] Multiple components in electronic device 10 are connected to I / O interface 15, including: input unit 16, such as keyboard, mouse, etc.; output unit 17, such as various types of displays, speakers, etc.; storage unit 18, such as disk, optical disk, etc.; and communication unit 19, such as network card, modem, wireless transceiver, etc. Communication unit 19 allows electronic device 10 to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks.

[0121] Processor 11 can be a variety of general-purpose and / or special-purpose processing components with processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various special-purpose artificial intelligence (AI) computing chips, various processors running machine learning model algorithms, a digital signal processor (DSP), and any suitable processor, controller, microcontroller, etc. Processor 11 performs the various methods and processes described above, such as training methods for object detection models, and / or object detection methods.

[0122] In some embodiments, the object detection model training method, and / or the object detection method, may be implemented as a computer program tangibly contained in a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded into and / or installed on electronic device 10 via ROM 12 and / or communication unit 19. When the computer program is loaded into RAM 13 and executed by processor 11, the object detection model training method described above, and / or one or more steps of the object detection method, may be performed. Alternatively, in other embodiments, processor 11 may be configured to execute the object detection model training method, and / or the object detection method, by any other suitable means (e.g., by means of firmware).

[0123] Various embodiments of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), systems-on-a-chip (SoCs), payload-programmable logic devices (CPLDs), computer hardware, firmware, software, and / or combinations thereof. These various embodiments may include implementations in one or more computer programs that can be executed and / or interpreted on a programmable system including at least one programmable processor, which may be a dedicated or general-purpose programmable processor, capable of receiving data and instructions from a storage system, at least one input device, and at least one output device, and transmitting data and instructions to the storage system, the at least one input device, and the at least one output device.

[0124] Computer programs used to implement the methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing device, such that when executed by the processor, the computer programs cause the functions / operations specified in the flowcharts and / or block diagrams to be performed. The computer programs may be executed entirely on a machine, partially on a machine, or as a standalone software package, partially on a machine and partially on a remote machine, or entirely on a remote machine or server.

[0125] In the context of this invention, a computer-readable storage medium can be a tangible medium that may contain or store a computer program for use by or in conjunction with an instruction execution system, apparatus, or device. A computer-readable storage medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination thereof. Alternatively, a computer-readable storage medium may be a machine-readable signal medium. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fibers, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination thereof.

[0126] To provide interaction with a user, the systems and techniques described herein can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user; and a keyboard and pointing device (e.g., a mouse or trackball) through which the user provides input to the electronic device. Other types of devices can also be used to provide interaction with the user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form (including sound input, voice input, or tactile input).

[0127] The systems and technologies described herein can be implemented in computing systems that include backend components (e.g., as data servers), or computing systems that include middleware components (e.g., application servers), or computing systems that include frontend components (e.g., user computers with graphical user interfaces or web browsers through which users can interact with implementations of the systems and technologies described herein), or any combination of such backend, middleware, or frontend components. The components of the system can be interconnected via digital data communication of any form or medium (e.g., communication networks). Examples of communication networks include local area networks (LANs), wide area networks (WANs), blockchain networks, and the Internet.

[0128] A computing system can include clients and servers. Clients and servers are generally located far apart and typically interact through communication networks. The client-server relationship is created by computer programs running on the respective computers and having a client-server relationship with each other. The server can be a cloud server, also known as a cloud computing server or cloud host, which is a hosting product within the cloud computing service system to address the shortcomings of traditional physical hosts and VPS services, such as high management difficulty and weak business scalability.

[0129] It should be understood that the various forms of processes shown above can be used, with steps reordered, added, or deleted. For example, the steps described in this invention can be executed in parallel, sequentially, or in different orders, as long as the desired result of the technical solution of this invention can be achieved, and this is not limited herein.

[0130] The specific embodiments described above do not constitute a limitation on the scope of protection of this invention. Those skilled in the art should understand that various modifications, combinations, sub-combinations, and substitutions can be made according to design requirements and other factors. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of this invention should be included within the scope of protection of this invention.

Claims

1. A training method for an object detection model, characterized in that, include: Based on the multi-view images of the sample scene where the autonomous vehicle is located, which are acquired by the image acquisition device, the dense features of the sample images are determined. Based on the sample point cloud data of the sample scene where the autonomous vehicle is located, collected by the remote sensing equipment, the sparse features of the sample image are determined. The dense features and sparse features of the sample image are fused to obtain the sample fused image features; The target detection model is trained based on the features of the fused sample image. The step of determining the dense features of the sample images based on the multi-view images of the sample scene where the autonomous vehicle is located, acquired by the image acquisition device, includes: Feature extraction is performed on multi-view images of samples in the sample scene where the autonomous vehicle is located, which are acquired by the image acquisition device, to obtain the first image features; A depth image classifier is used to perform depth density prediction on the first image features to obtain the depth image features corresponding to the first image features. The extrinsic parameters of the image acquisition device and the depth image features are input into the three-dimensional projection model. After processing by the three-dimensional projection model, the three-dimensional image features are obtained. The three-dimensional image features are subjected to convolutional coding to obtain dense features of the sample image; The step of determining the sparse features of the sample image based on the sample point cloud data in the sample scene where the autonomous vehicle is located, collected by the remote sensing equipment, includes: The sample point cloud data of the autonomous vehicle in the sample scene collected by the remote sensing equipment is processed into voxels to obtain voxel features. For each voxel feature, based on the projection area of ​​the voxel feature in the camera coordinate system, the image features in the projection area are extracted from the first image features and used as the region image features. Based on a deformable convolutional network, a deviation calculation is performed on the region image features and the voxel features to obtain the deviation amount of the voxel features; A cross-attention network is used to fuse the voxel features and the bias, thereby determining the fused voxel features; The fused voxel features are encoded to obtain sparse features of the sample image.

2. The method according to claim 1, characterized in that, The step of training the target detection model based on the sample fusion image features includes: The target prediction network in the target detection model is used to predict the target in the sample fused image features to obtain the first predicted target in the sample scene. The target detection model is trained based on the first predicted target and the supervised data in the sample scene.

3. A target detection method, characterized in that, include: Acquire multi-view images and point cloud data of the target in the target scene where the autonomous vehicle is located; A target detection model is used to detect targets in the multi-view image and point cloud data of the target to obtain a second predicted target in the target scene; wherein the target detection model is trained based on the training method of the target detection model according to any one of claims 1-2.

4. A training device for an object detection model, characterized in that, include: The image density feature determination module is used to determine the density features of the sample images based on the multi-view images of the sample scene where the autonomous vehicle is located, which are acquired by the image acquisition device. The image sparse feature determination module is used to determine the sparse features of the sample image based on the sample point cloud data in the sample scene where the autonomous vehicle is located, collected by the remote sensing equipment. The fused image feature determination module is used to fuse the dense features and sparse features of the sample image to obtain fused image features. The target detection model training module is used to train the target detection model based on the sample fused image features; Specifically, the image sparse feature determination module is used for: Feature extraction is performed on multi-view images of samples in the sample scene where the autonomous vehicle is located, which are acquired by the image acquisition device, to obtain the first image features; A depth image classifier is used to perform depth density prediction on the first image features to obtain the depth image features corresponding to the first image features. The extrinsic parameters of the image acquisition device and the depth image features are input into the three-dimensional projection model. After processing by the three-dimensional projection model, the three-dimensional image features are obtained. The three-dimensional image features are subjected to convolutional coding to obtain dense features of the sample image; The image sparse feature determination module includes: The voxel feature determination unit is used to perform voxelization processing on the sample point cloud data in the sample scene where the autonomous vehicle is located, collected by the remote sensing equipment, to obtain voxel features. The fused voxel feature determination unit is configured to, for each voxel feature, extract image features in the projection region of the voxel feature in the camera coordinate system from the first image features as region image features; perform deviation calculation on the region image features and the voxel features based on a deformable convolutional network to obtain the deviation amount of the voxel features; and fuse the voxel features and the deviation amount using a cross-attention network to determine the fused voxel features. The image sparse feature determination unit is used to encode the fused voxel features to obtain the sparse features of the sample image.

5. A target detection device, characterized in that, include: The data acquisition module is used to acquire multi-view images and point cloud data of the target in the target scene where the autonomous vehicle is located; The target prediction module is used to perform target detection on the multi-view image of the target and the point cloud data of the target using a target detection model to obtain a second predicted target in the target scene; wherein the target detection model is trained based on the training method of the target detection model according to any one of claims 1-2.

6. An electronic device, characterized in that, The electronic device includes: At least one processor; and A memory communicatively connected to the at least one processor; wherein, The memory stores a computer program executable by the at least one processor, which enables the at least one processor to perform the training method of the target detection model according to any one of claims 1-2, and / or the target detection method according to claim 3.

7. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer instructions that cause a processor to execute and implement the training method of the target detection model according to any one of claims 1-2, and / or the target detection method according to claim 3.