Model training method, vehicle control method, device, equipment, vehicle and medium

By using two-dimensional reference box information to supervise the training of three-dimensional detection box information in the target detection model, the problem that LiDAR cannot capture the three-dimensional information of distant targets is solved, thus improving the detection performance and three-dimensional detection accuracy of the model.

CN122200596APending Publication Date: 2026-06-12ZHEJIANG GEELY HLDG GRP CO LTD +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
ZHEJIANG GEELY HLDG GRP CO LTD
Filing Date
2026-03-12
Publication Date
2026-06-12

Smart Images

  • Figure CN122200596A_ABST
    Figure CN122200596A_ABST
Patent Text Reader

Abstract

The application discloses a model training method, a vehicle control method, a device, equipment, a vehicle and a medium. The method comprises the following steps: in the process of training an initial model by using a first training sample, for each first training sample, inputting a first image into the initial model to obtain a detection result output by the initial model, the detection result comprising three-dimensional information, the three-dimensional information comprising three-dimensional detection box information predicted for each target object in the first image; determining a model loss value of the initial model according to two-dimensional reference box information and the three-dimensional detection box information of the target object; and adjusting a model parameter of the model according to the model loss value, implementing supervised training on the predicted three-dimensional detection box information by using the information of the two-dimensional reference box, assisting in supervising and constraining the target without three-dimensional reference box information, so that the trained target detection model has more stable detection performance, and the prediction accuracy of the three-dimensional detection information is improved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application belongs to the field of target detection technology, and in particular relates to a model training method, a vehicle control method, a device, equipment, a vehicle, and a medium. Background Technology

[0002] With the development of in-vehicle automated intelligent driving systems, target detection is the core visual perception task of intelligent driving systems. Detecting targets in front of the vehicle in advance and perceiving traffic conditions ahead are key to achieving route planning in advance, avoiding collision risks, and ensuring driving safety.

[0003] In related technologies, the training of target detection models requires the three-dimensional information of the target to complete the training optimization in order to achieve accurate estimation of key parameters such as the target's three-dimensional position and size. The three-dimensional information is mainly obtained by LiDAR perception, and the perception results of LiDAR are an indispensable core data support in the model training process.

[0004] For targets that are far in front of the vehicle, the lidar cannot effectively capture their point cloud signals, and therefore cannot perceive and obtain the three-dimensional information of distant targets. As a result, during the sample labeling stage of model training, only the three-dimensional information of targets that are close to the vehicle can be obtained. Due to the lack of three-dimensional information input of distant targets, the trained model cannot accurately estimate the three-dimensional parameters of distant targets, which seriously affects the accuracy of target detection. Summary of the Invention

[0005] This application provides a model training method, vehicle control method, device, equipment, vehicle, and medium. It uses information from two-dimensional reference boxes to supervise the training of predicted three-dimensional detection box information. This can provide auxiliary supervision constraints for target objects without three-dimensional reference box information, enabling the trained target detection model to have more stable detection performance and improving the prediction accuracy of three-dimensional detection information.

[0006] In a first aspect, embodiments of this application provide a model training method, the method comprising: Multiple first training samples are acquired. Each first training sample includes a first image and two-dimensional information corresponding to the first image. The two-dimensional information includes two-dimensional reference box information that surrounds each target object in the first image. The multiple first images include a distant view image in front of the vehicle. During the training of the initial model using the first training samples, for each first training sample, the first image is input into the initial model to obtain the detection result output by the initial model. The detection result includes three-dimensional information, which includes three-dimensional detection box information predicted for each target object in the first image to surround the target object. The model loss value of the initial model is determined based on the two-dimensional reference box information and the corresponding three-dimensional detection box information of each target object. Based on the model loss value, the model parameters of the initial model are adjusted until the initial model meets the preset iteration stopping condition to obtain the target detection model, which is used to perform target detection on the input image.

[0007] In one embodiment of this application, determining the model loss value of the initial model based on the two-dimensional reference box information and the corresponding three-dimensional detection box information of each target object includes: For each target object, the two-dimensional virtual bounding box information of the target object is determined based on the three-dimensional detection bounding box information of the target object; The model loss value is obtained based on the two-dimensional virtual bounding box information of the target object and the two-dimensional reference bounding box information of the target object.

[0008] In one embodiment of this application, the detection result further includes the coordinates of key points of the target object predicted for each target object in the first image, the depth value corresponding to the target object, the orientation angle corresponding to the target object, the two-dimensional detection box information of the target object, and the three-dimensional size of the target object, wherein the key point is the center point of the two-dimensional detection box of the target object in the first image; Determining the two-dimensional virtual bounding box information of the target object based on the three-dimensional detection bounding box information of the target object includes: The depth value corresponding to the target object is normalized to obtain the first depth value corresponding to the target object. Based on the first depth value corresponding to the target object, the coordinates of the key points of the target object, and the preset camera intrinsic parameters, determine the three-dimensional coordinates of the center point of the target object in the camera coordinate system; Based on the three-dimensional dimensions of the target object, the orientation angle corresponding to the target object, and the three-dimensional coordinates of the center point of the target object, a set of three-dimensional coordinates of the target object in the camera coordinate system is determined. The set of three-dimensional coordinates includes the three-dimensional coordinates of each of the eight corner points, and the space formed by the eight corner points is used to surround the target object. Based on the three-dimensional coordinates of the eight corner points and the camera intrinsic parameters, determine the coordinates of the first projection point of each of the eight corner points projected onto the first image; Based on the coordinates of the first projection point corresponding to each corner point, the coordinates of the two-dimensional virtual frame are obtained, and the coordinates of the two-dimensional virtual frame are used as the information of the two-dimensional virtual frame. The two-dimensional virtual frame is used to enclose the coordinates of the first projection point corresponding to each corner point.

[0009] In one embodiment of this application, the two-dimensional virtual bounding box information of the target object includes the coordinates of the two-dimensional virtual bounding box of the target object; the two-dimensional reference box information of the target object includes the coordinates of the two-dimensional reference box of the target object and the distance from each edge of the two-dimensional reference box to the second projection point; the detection result also includes the distance from each edge of the two-dimensional detection box used to surround the target object to the third projection point predicted for each target object in the first image. The step of obtaining the model loss value based on the two-dimensional virtual bounding box information of the target object and the two-dimensional reference bounding box information of the target object includes: A first loss value is determined based on the coordinates of the two-dimensional virtual bounding box of the target object and the coordinates of the two-dimensional reference bounding box of the target object; The second loss value is determined based on the distance from each edge of the two-dimensional detection box of the target object to the third projection point and the distance from each edge of the two-dimensional reference box of the target object to the second projection point. The model loss value is determined based on the first loss value and the second loss value.

[0010] In one embodiment of this application, the detection result further includes the coordinates of key points of each target object and the target category of each target object, wherein the key point is the center point of the two-dimensional detection box of the target object in the first image; the first training sample further includes the reference category of each target object and the coordinates of the reference point of each target object, wherein the reference point is the center point of the two-dimensional reference box of the target object in the first image; Determining the model loss value based on the first loss value and the second loss value includes: A third loss value is determined based on the target category and the reference category of the target object; The fourth loss value is determined by calculating the coordinates of the key points of the target object and the coordinates of the reference point of the target object. The model loss value is obtained based on the first loss value, the second loss value, the third loss value, and the fourth loss value.

[0011] In one embodiment of this application, obtaining a plurality of first training samples includes: Acquire first data, which includes multiple original images and two-dimensional reference box information corresponding to the multiple original images. The multiple original images include at least one target object and an original background area. The multiple original images include a distant view image in front of the vehicle and a close-up image in front of the vehicle. Each of the plurality of original images is processed as follows: the original image is preprocessed to obtain a second image, the preprocessing including image cropping and image magnification, the second image including at least one target object and a local background region, the local background region being the distant region in the original background region; the two-dimensional reference box information corresponding to the second image is obtained according to the target transformation matrix and the two-dimensional reference box information corresponding to the original image, the target transformation matrix being determined according to the mapping relationship between the pixel coordinates of the original image and the pixel coordinates of the second image; The first image is formed by combining the original images and the second images. Each of the first images and its corresponding two-dimensional reference box information is used as a training sample.

[0012] Secondly, embodiments of this application provide a vehicle control method, the method being applied to an electronic device, the method comprising: Acquire the images to be detected collected by the vehicle; An object detection model is used to detect objects in the image to be detected, and the output result of the object detection model is obtained. The output result contains three-dimensional information of the object. The three-dimensional information of the object includes three-dimensional detection box information predicted for each object in the image to surround the object. The object detection model is obtained by the model training method as described in the first aspect. The vehicle is controlled to move based on the three-dimensional detection bounding box information of the object.

[0013] Thirdly, embodiments of this application provide a model training apparatus, the apparatus comprising: The first acquisition module is used to acquire multiple first training samples. Each first training sample includes a first image and two-dimensional information corresponding to the first image. The two-dimensional information includes two-dimensional reference box information that surrounds each target object in the first image. The multiple first images include a distant view image in front of the vehicle. The training module is used to input the first image into the initial model for each of the first training samples during the training of the initial model using the first training samples, and obtain the detection result output by the initial model. The detection result includes three-dimensional information, which includes three-dimensional detection box information predicted for each target object in the first image to surround the target object. The determination module is used to determine the model loss value of the initial model based on the two-dimensional reference box information and the corresponding three-dimensional detection box information of each target object; The processing module is used to adjust the model parameters of the initial model according to the model loss value until the initial model meets the preset iteration stopping condition to obtain the target detection model, which is used to perform target detection on the input image.

[0014] Fourthly, embodiments of this application provide a vehicle control device, the device comprising: The second acquisition module is used to acquire the image to be detected collected by the vehicle; The detection module is used to perform target detection on the image to be detected using a target detection model, and to obtain the output result of the target detection model. The output result contains target three-dimensional information, which includes three-dimensional detection box information predicted for each object in the image to surround the object. The target detection model is obtained by the model training method as described in the first aspect. The control module is used to control the vehicle's movement based on the object's three-dimensional detection frame information.

[0015] Fifthly, embodiments of this application provide an electronic device, including: a processor and a memory storing computer program instructions; When the processor executes the computer program instructions, it implements the model training method as described in the first aspect, or the vehicle control method as described in the second aspect.

[0016] Sixthly, embodiments of this application provide a vehicle including the electronic equipment described in the fifth aspect.

[0017] In a seventh aspect, embodiments of this application provide a computer-readable storage medium storing computer program instructions, which, when executed by a processor, implement the model training method as described in the first aspect, or the vehicle control method as described in the second aspect.

[0018] Eighthly, embodiments of this application provide a computer program product in which instructions, when executed by a processor of an electronic device, cause the electronic device to perform the model training method as described in the first aspect, or the vehicle control method as described in the second aspect.

[0019] This application provides a model training method, vehicle control method, device, equipment, vehicle, and medium. It acquires multiple first training samples, each first training sample including a first image and corresponding two-dimensional information. The two-dimensional information includes two-dimensional reference boxes surrounding each target object in the first image. The multiple first images include a distant view image in front of the vehicle. During the initial model training process using the first training samples, for each first training sample, the first image is input into the initial model to obtain a detection result output by the initial model. The detection result includes three-dimensional information, which includes each target object in the first image. The system predicts 3D bounding boxes for each object to surround it. Based on the 2D reference box information and the corresponding 3D bounding box information for each object, it determines the model loss value of the initial model. Based on the model loss value, it adjusts the model parameters of the initial model until the initial model meets a preset iteration stopping condition, thus obtaining a target detection model. This target detection model is used to detect objects in the input image. In the above steps, the 2D reference box information is used to supervise the training of the predicted 3D bounding box information. This provides auxiliary supervision constraints for targets without 3D reference box information, enabling the trained target detection model to have more stable detection performance and improving the prediction accuracy of 3D detection information. Attached Figure Description

[0020] To more clearly illustrate the technical solutions of the embodiments of this application, the accompanying drawings used in the embodiments of this application will be briefly introduced below. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0021] Figure 1 This is a schematic flowchart of the model training method provided in an embodiment of this application; Figure 2 This is a schematic diagram of the output results of the target detection model provided in the embodiments of this application; Figure 3 This is a schematic diagram showing the image before and after preprocessing provided in an embodiment of this application; Figure 4 This is a schematic flowchart of a vehicle control method provided in an embodiment of this application; Figure 5 This is a schematic diagram of the structure of the model training device provided in the embodiments of this application; Figure 6 This is a schematic diagram of the vehicle control device provided in the embodiments of this application; Figure 7 This is a schematic diagram of the structure of the electronic device provided in the embodiments of this application. Detailed Implementation

[0022] The features and exemplary embodiments of various aspects of this application will be described in detail below. To make the objectives, technical solutions, and advantages of this application clearer, the application will be further described in detail below with reference to the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described herein are only intended to explain this application and not to limit it. For those skilled in the art, this application can be implemented without some of these specific details. The following description of the embodiments is merely to provide a better understanding of this application by illustrating examples.

[0023] It should be noted that, in this document, relational terms such as "first" and "second" are used merely to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising..." does not exclude the presence of additional identical elements in the process, method, article, or apparatus that includes said element.

[0024] In all specific embodiments of this application, when processing data related to user identity or characteristics, such as user information, user behavior data, user historical data, and user location information, user permission or consent is obtained first. Furthermore, the collection, use, and processing of this data comply with relevant laws, regulations, and standards. Additionally, when embodiments of this application require access to sensitive personal information, separate permission or consent from the user is obtained through pop-ups or redirects to confirmation pages. Only after obtaining the user's separate permission or consent is the necessary user-related data required for the proper functioning of these embodiments obtained.

[0025] To address the problems of the prior art, embodiments of this application provide a model training method, a vehicle control method, an apparatus, a device, a vehicle, and a medium. The model training method provided in this application embodiment will be described first below.

[0026] Figure 1 A schematic flowchart of a model training method provided in one embodiment of this application is shown. Figure 1 As shown, the model training method provided in this application embodiment is applied to an electronic device and includes the following steps 101-104, wherein: Step 101: Obtain multiple first training samples. Each first training sample includes a first image and two-dimensional information corresponding to the first image. The two-dimensional information includes two-dimensional reference box information that surrounds each target object in the first image. The multiple first images include a distant view image in front of the vehicle.

[0027] In this embodiment, multiple first training samples are obtained. Each first training sample includes a first image and two-dimensional information corresponding to the first image. The two-dimensional information includes two-dimensional reference box information that surrounds each target object in the first image. The two-dimensional reference box includes four sides and four points. The two-dimensional reference box information includes the distance of each side of the two-dimensional reference box relative to the projection point, and / or the coordinates of the four points of the two-dimensional reference box.

[0028] The aforementioned first image may be captured by a monocular camera configured in the vehicle. Multiple first images include distant images of the front of the vehicle and / or close-up images of the front of the vehicle. The vehicle here is not limited to the same vehicle; it may be an image captured by the same vehicle or images captured by multiple vehicles. The first image is a two-dimensional image.

[0029] Step 102: During the training of the initial model using the first training samples, for each first training sample, the first image is input into the initial model to obtain the detection result output by the initial model. The detection result includes three-dimensional information, which includes three-dimensional detection box information predicted for each target object in the first image to surround the target object.

[0030] In this embodiment, multiple first training samples are used to train the initial model. During the training process of the initial model using the first training samples, for each first training sample, the first training sample is input into the initial model to obtain the detection result output by the initial model. The detection result includes three-dimensional information, which includes three-dimensional detection box information predicted for each target object in the first image to surround the target object.

[0031] See Figure 2The initial model outputs detection results including two-dimensional information, which includes two-dimensional detection bounding boxes predicted for each target object in the first image. The detection results also include the three-dimensional dimensions of the target object (including length, width, and height), the depth value corresponding to the target object, the orientation angle corresponding to the target object, as well as key points and categories. The key point is the center point of the two-dimensional detection bounding box of the target object in the first image. Based on the two-dimensional detection bounding box information of the target object, the depth value corresponding to the target object, the orientation angle corresponding to the target object, and the three-dimensional dimensions of the target object, the three-dimensional detection bounding box information of the target object is determined. It can be understood that the three-dimensional detection bounding box information of the target object can be described based on the above information.

[0032] The initial model obtains detection results based on the outputs of multiple feature layers. The output size of each piece of information to be predicted is the same as the output size of that feature layer. For a given feature layer, assuming the feature layer size is (feat_h, feat_w), the predicted size of the keypoints predicted by that feature layer is (feat_h, feat_w), while the predicted size of the class is (N_c, feat_h, feat_w), where N_c is the predicted class. The predicted size of the 2D bounding box information is (4, feat_h, feat_w), and similarly, the predicted size of the 3D bounding box information is (N_3, feat_h, feat_w), where N_3 represents the number of 3D information items to be output; for example, a size of 3 represents the length, width, and height. The initial model can be the fcos3d model or other models used for object detection.

[0033] A screening threshold, or pre-set confidence level, is set for the initial model. For each candidate object detected in the first image, a confidence level exists. The confidence level of each candidate object is compared with the pre-set confidence level. If the confidence level of a candidate object is greater than the pre-set confidence level, the candidate object is identified as the target object, and the detection result containing the 3D detection box information of the target object is further obtained. If the confidence level of a candidate object is less than or equal to the pre-set confidence level, the candidate object is not the target object. The above confidence level is the confidence level of the category, which is the category plus the corresponding confidence level (a value between 0 and 1). The closer the confidence level is to 1, the higher the probability that the model determines that the target belongs to this category and the more reliable the result.

[0034] Step 103: Determine the model loss value of the initial model based on the two-dimensional reference box information and the corresponding three-dimensional detection box information of each target object.

[0035] In this embodiment, the model loss value of the initial model is determined based on the two-dimensional reference frame information of each target object and the corresponding three-dimensional detection information.

[0036] Step 104: Adjust the model parameters of the initial model according to the model loss value until the initial model meets the preset iteration stopping condition to obtain the target detection model, which is used to perform target detection on the input image.

[0037] In this embodiment, the model parameters of the initial model are adjusted according to the model loss value until the initial model meets the preset iteration stopping condition to obtain the target detection model, which is used to perform target detection on the input image.

[0038] Optionally, the iteration stopping condition of the model can be preset, such as preseting the number of iterations, and determining that the initial model meets the preset iteration stopping condition when the number of iterations reaches the preset number of iterations; or, preseting the loss value, and determining that the initial model meets the preset iteration stopping condition when the model loss value of the initial model reaches the preset loss value.

[0039] In this embodiment, the information of the two-dimensional reference box is used to supervise the training of the predicted three-dimensional detection box information. This can provide auxiliary supervision and constraints for target objects without three-dimensional reference box information, so that the trained target detection model has more stable detection performance and improves the prediction accuracy of the three-dimensional detection information.

[0040] In one embodiment of this application, determining the model loss value of the initial model based on the two-dimensional reference box information and the corresponding three-dimensional detection box information of each target object includes: For each target object, the two-dimensional virtual bounding box information of the target object is determined based on the three-dimensional detection bounding box information of the target object; The model loss value is obtained based on the two-dimensional virtual bounding box information of the target object and the two-dimensional reference bounding box information of the target object.

[0041] In this embodiment, for each target object, the two-dimensional virtual bounding box information of the target object is determined based on the three-dimensional detection bounding box information of the target object. It can be understood that the two-dimensional virtual bounding box information of the target object can be obtained by converting the three-dimensional detection bounding box information. This two-dimensional virtual bounding box is not directly obtained from the initial model, but indirectly obtained through the detection results of the initial model.

[0042] Furthermore, based on the two-dimensional virtual bounding box information and the two-dimensional reference bounding box information of the target object, the initial model loss value is obtained.

[0043] Supervised training of predicted 3D detection boxes using 2D reference box information provides auxiliary supervisory constraints for target objects lacking 3D reference box information. This results in a more stable detection performance for the trained target detection model and improves the prediction accuracy of 3D detection information. It supplements effective spatial constraint information for target objects lacking 3D annotations, compensating for the insufficiency of 3D labeled data and enabling the model to establish more accurate 2D-3D feature associations during the learning process. Simultaneously, the hard constraints of the 2D reference boxes effectively suppress the divergence problem in 3D parameter prediction, reducing prediction errors for unlabeled targets, thereby improving the overall detection robustness and 3D parameter regression accuracy of the model.

[0044] In one embodiment of this application, the detection result further includes the coordinates of key points of the target object predicted for each target object in the first image, the depth value corresponding to the target object, the orientation angle corresponding to the target object, the two-dimensional detection box information of the target object, and the three-dimensional dimensions of the target object, wherein the key point is the center point of the two-dimensional detection box of the target object in the first image; Determining the two-dimensional virtual bounding box information of the target object based on the three-dimensional detection bounding box information of the target object includes: The depth value corresponding to the target object is normalized to obtain the first depth value corresponding to the target object. Based on the first depth value corresponding to the target object, the coordinates of the key points of the target object, and the preset camera intrinsic parameters, determine the three-dimensional coordinates of the center point of the target object in the camera coordinate system; Based on the three-dimensional dimensions of the target object, the orientation angle corresponding to the target object, and the three-dimensional coordinates of the center point of the target object, a set of three-dimensional coordinates of the target object in the camera coordinate system is determined. The set of three-dimensional coordinates includes the three-dimensional coordinates of each of the eight corner points, and the space formed by the eight corner points is used to surround the target object. Based on the three-dimensional coordinates of the eight corner points and the camera intrinsic parameters, determine the coordinates of the first projection point of each of the eight corner points projected onto the first image; Based on the coordinates of the first projection point corresponding to each corner point, the coordinates of the two-dimensional virtual frame are obtained, and the coordinates of the two-dimensional virtual frame are used as the information of the two-dimensional virtual frame. The two-dimensional virtual frame is used to enclose the coordinates of the first projection point corresponding to each corner point.

[0045] In this embodiment, the detection results of the initial model also include the coordinates of the key points of the target object predicted for each target object in the first image, the depth value corresponding to the target object, the orientation angle corresponding to the target object, the two-dimensional detection box information of the target object, and the three-dimensional dimensions of the target object (including length, width and height). The two-dimensional detection box information includes the distance of each side of the two-dimensional detection box relative to the projection point, and / or the coordinates of the four points of the two-dimensional detection box.

[0046] The depth value corresponding to the target object is normalized to obtain the first depth value corresponding to the target object. Specifically: (1); in, This is the first depth value corresponding to the target object, i.e., the actual depth value. This represents the depth value corresponding to the target object, i.e., the value predicted by the initial model. f The average focal length, It is a constant, usually set to 1000.

[0047] The average focal length is calculated as follows: (2); in, f The average focal length, f x Let be the focal length of the camera along the x-axis. f y Let be the focal length of the camera along the y-axis.

[0048] Furthermore, based on the first depth value corresponding to the target object, the coordinates of the key points of the target object, and the preset camera intrinsic parameters, the three-dimensional coordinates of the center point of the target object in the camera coordinate system are determined. The key points of the target object are the center points of the two-dimensional detection boxes of the target object in the first image. Specifically: (3); in, () represents the three-dimensional coordinates of the center point of the target object, and T represents the transpose. Preset camera intrinsic parameter matrix The inverse matrix, These are the coordinates of the key points of the target object. This is the first depth value.

[0049] Furthermore, the three-dimensional dimensions of the target object, the orientation angle of the target object, and the three-dimensional coordinates of the center point of the target object are used to determine the three-dimensional coordinate set of the target object in the camera coordinate system. The three-dimensional coordinate set includes the three-dimensional coordinates of each of the eight corner points. The space formed by the eight corner points is used to surround the target object. The three-dimensional detection box is formed by the eight corner points.

[0050] Furthermore, based on the three-dimensional coordinates of each of the eight corner points... Using the camera intrinsic parameter K, we obtain the coordinates of the first projection point of each corner point onto the first image. Specifically, for each corner point, we project the corner point based on its 3D coordinates and the camera intrinsic parameter to obtain the coordinates of the first projection point of each corner point onto the first image. The coordinates of the first projection points of all eight corner points onto the first image are then calculated. .

[0051] Based on the coordinates of the first projection point corresponding to each corner point, the coordinates of the two-dimensional virtual box are obtained. Specifically, the extreme values ​​of all corner points on the X and Y axes are taken to form an axis-aligned bounding box. The coordinates of the two-dimensional virtual box surrounding the eight corner points are as follows: ; ; ; ; Converting 3D bounding box information into 2D virtual bounding box information allows for the use of low-cost 2D annotations to compensate for scarce 3D annotations.

[0052] In one embodiment of this application, the two-dimensional virtual bounding box information of the target object includes the coordinates of the two-dimensional virtual bounding box of the target object; the two-dimensional reference box information of the target object includes the coordinates of the two-dimensional reference box of the target object and the distance from each edge of the two-dimensional reference box to the second projection point; the detection result also includes the distance from each edge of the two-dimensional detection box used to surround the target object to the third projection point predicted for each target object in the first image. The step of obtaining the model loss value based on the two-dimensional virtual bounding box information and the two-dimensional reference bounding box information of the target object includes: A first loss value is determined based on the coordinates of the two-dimensional virtual bounding box of the target object and the coordinates of the two-dimensional reference bounding box of the target object; The second loss value is determined based on the distance from each edge of the two-dimensional detection box of the target object to the third projection point and the distance from each edge of the two-dimensional reference box of the target object to the second projection point. The model loss value is determined based on the first loss value and the second loss value.

[0053] In this embodiment, the two-dimensional virtual bounding box information of the target object includes the coordinates of the two-dimensional virtual bounding box of the target object; the two-dimensional reference bounding box information of the target object includes the coordinates of the two-dimensional reference bounding box of the target object, and the distance from each edge of the two-dimensional reference bounding box to the second projection point; the detection result output by the initial model also includes the two-dimensional detection box information predicted for each target object in the first image to surround the target object, and the distance from each edge of the two-dimensional detection box predicted for each target object in the first image to the third projection point.

[0054] For the same target object in each first image, a first loss value is obtained by calculating the loss based on the coordinates of the target object's 2D virtual bounding box and the coordinates of the target object's 2D reference bounding box. The average of the first values ​​corresponding to all target objects in each first image is then calculated to obtain the first loss value, which can be considered a pseudo-3D loss. For the same target object in each first image, a second loss value is obtained by calculating the loss based on the distance from each edge of the target object's 2D detection bounding box to the third projection point and the distance from each edge of the target object's 2D reference bounding box to the second projection point. The average of the second values ​​corresponding to all target objects in each first image is then calculated to obtain the second loss value.

[0055] The model loss value can be calculated by adding the first loss value and the second loss value.

[0056] The 3D detection bounding box information is converted into 2D virtual bounding box information for further loss calculation, reducing the threshold for model training and deployment. This assists the model to converge quickly, reducing the difficulty of training and optimizing 3D detection models.

[0057] In one embodiment of this application, the detection result further includes the coordinates of key points of each target object and the target category of each target object, wherein the key point is the center point of the two-dimensional detection box of the target object in the first image; the first training sample further includes the reference category of each target object and the coordinates of the reference point of each target object, wherein the reference point is the center point of the two-dimensional reference box of the target object in the first image; Determining the model loss value based on the first loss value and the second loss value includes: A third loss value is determined based on the target category and the reference category of the target object; The fourth loss value is determined by calculating the coordinates of the key points of the target object and the coordinates of the reference point of the target object. The model loss value is obtained based on the first loss value, the second loss value, the third loss value, and the fourth loss value.

[0058] In this embodiment, for the same target object in each first image, a loss is calculated based on the target object's target category and reference category to obtain a third value. The average of the third values ​​corresponding to all objects in each first image is then calculated to obtain a third loss value. For the same target object in each first image, a loss is calculated based on the coordinates of the target object's key points and the coordinates of the target object's reference points to obtain a fourth value. The average of the fourth values ​​corresponding to all objects in each first image is then calculated to obtain a fourth loss value.

[0059] Here, the reference point is the center point of the two-dimensional reference box of the target object in the first image; the key point is the center point of the two-dimensional detection box of the target object in the first image; and the category refers to the determination result of "what" each detected object is.

[0060] The model loss value is calculated as follows: the first loss value, the second loss value, the third loss value and the fourth loss value are added together to obtain the model loss value.

[0061] First, calculate the loss of each prediction information, and then calculate the overall model loss. This prevents a key parameter from being masked by the losses of other parameters, ensuring that each parameter can be effectively optimized. Calculating the loss of each parameter separately and then summing them up results in clearer gradient signals, smoother model convergence, and reduces the likelihood of the entire training process collapsing due to excessive error in a single parameter.

[0062] In one embodiment of this application, obtaining a plurality of first training samples includes: Acquire first data, which includes multiple original images and two-dimensional reference box information corresponding to the multiple original images. The multiple original images include at least one target object and an original background area. The multiple original images include a distant view image in front of the vehicle and a close-up image in front of the vehicle. Each of the plurality of original images is processed as follows: the original image is preprocessed to obtain a second image, the preprocessing including image cropping and image magnification, the second image including at least one target object and a local background region, the local background region being the distant region in the original background region; the two-dimensional reference box information corresponding to the second image is obtained according to the target transformation matrix and the two-dimensional reference box information corresponding to the original image, the target transformation matrix being determined according to the mapping relationship between the pixel coordinates of the original image and the pixel coordinates of the second image; The first image is formed by combining the original images and the second images. Each of the first images and its corresponding two-dimensional reference box information is used as a training sample.

[0063] In this embodiment, first data is obtained, which can be obtained from the video captured by the vehicle. The first data includes multiple original images and two-dimensional reference box information corresponding to each original image. Each original image includes at least one target object and an original background area. The multiple original images include a distant view image in front of the vehicle and a close-up image in front of the vehicle.

[0064] Furthermore, each of the multiple original images is processed as follows: the original image is preprocessed, including image cropping and image magnification, to obtain a second image. The second image includes at least one target object and a layout background region, wherein the local background region is the distant region in the original background region; the two-dimensional reference box information corresponding to the second image is obtained based on the target transformation matrix and the two-dimensional reference box information corresponding to the original image.

[0065] The target transformation matrix is ​​determined based on the mapping relationship between the pixel coordinates of the original image and the pixel coordinates of the second image. The two-dimensional reference box is generally based on the original image of the forward-looking camera. When cropping the central region of the image, the two-dimensional reference box will change. Therefore, the target transformation matrix needs to be calculated first to obtain the two-dimensional reference box information corresponding to the second image.

[0066] Multiple original images and multiple second images are used as the first image, and each first image and its corresponding two-dimensional reference box information are used as a training sample.

[0067] Optionally, the original image is preprocessed to obtain a second image, including: Identify the central region of the original image, which includes the target object; The image is cropped based on the central region of the original image to obtain the cropped image. The cropped image is enlarged to obtain a second image.

[0068] See Figure 3 The image on the left is the original image. The central region of the original image is identified, and this central region includes the target object, which can be a pedestrian, vehicle, animal, etc. The central region of the original image is then cropped to obtain a cropped image. This cropped image is then enlarged to obtain a second image. One original image can yield at least one second image. Figure 3 As shown, two second images are obtained from one original image.

[0069] It should be noted that different car models, different cameras, and different preprocessing methods will result in images with different focal lengths. After the above processing, the size of the same target object will be inconsistent across different images. According to the principle of perspective, objects appear larger when closer and smaller when farther away, so different sizes should correspond to different depths. Therefore, the depth value corresponding to the target object is normalized for different camera focal lengths to re-adhere to the principle of perspective. The following formula is used to normalize the true depth value of the target object, yielding a reference depth value: (4); in, This is the reference depth value for the target object. This represents the actual depth of the target object. f The average focal length, It is a constant, usually set to 1000, to ensure and They are within the same order of magnitude.

[0070] Optionally, the first data also includes three-dimensional information corresponding to the original image. The three-dimensional information includes three-dimensional reference box information that surrounds each target object in the original image. Since the original image includes a distant view and a close-up view in front of the vehicle, only a portion of the original image in the first data contains corresponding three-dimensional information. This portion of the original image is a close-up view in front of the vehicle. During the image acquisition process, the vehicle can also acquire the three-dimensional information of the target object through the vehicle's sensors. Furthermore, by combining the two-dimensional reference box information, the three-dimensional reference box information is obtained.

[0071] The original image containing 3D information can also be used as the first image. During the training process using the first image containing 3D information, the first image is input into the initial model to obtain the detection results output by the initial model. The detection results include the 3D detection box information predicted for each target object in the first image to surround the target object. Based on the 3D reference box information and the corresponding 3D detection box information of each target object, the model loss value of the initial model is determined. Based on the model loss value, the model parameters of the initial model are adjusted until the initial model meets the preset iteration stopping condition to obtain the target detection model.

[0072] There is a difference in calculating the model loss for a first image that contains both 3D and 2D information and a first image that contains only 2D information but no 3D information. Specifically, for a target image in a first image containing both 3D and 2D information, the reference point of the target image is the projection center point of the center point of the 3D reference box onto the first image, and the key points of the target image are the projection center points of the center points of the 3D detection boxes onto the first image. For a target image in a first image containing only 2D reference box information, the reference point of the target image is the center point of the 2D reference box on the first image, and the key points of the target object are the center points of the 2D detection boxes of the target image in the first image. Generally speaking, for distant targets, the center point of the 2D reference box and the projection center point are close to coincide, so the model can have good adaptability during training.

[0073] Cropping the central region of the image as training data amplifies the feature information of far-field targets, and changes in far-field targets are more noticeable in the image, which is more conducive to the detection of far-field targets. This approach requires no additional hardware costs, relying solely on pure images for far-field target detection. Simultaneously detecting both 2D and 3D target information in a single model, with multi-task networks complementing each other, significantly improves the target detection rate and generalization, while reducing post-processing complexity. Using 2D bounding box ground truth to supervise 3D prediction information provides auxiliary supervision for targets without 3D ground truth. Compared to relying solely on the model's generalization ability to detect targets without 3D ground truth, the model's detection is more stable, and the 3D information is more accurate. For targets without 3D ground truth, the model's prediction of the target's orientation angle, based on the poses of other targets with 3D ground truth in the image, has a certain degree of generalization. The target's size information, based on the target's accurate category information, also contributes to the model's generalization ability. Furthermore, normalizing the target's depth ground truth allows the model to retain the near-larger, farther-smaller depth estimation pattern during the learning process, reducing the learning difficulty.

[0074] Figure 4 A schematic flowchart of a vehicle control method according to an embodiment of this application is shown. Figure 4 As shown, the vehicle control method provided in this application embodiment is applied to an electronic device and includes the following steps 401-403, wherein: Step 401: Acquire the image to be detected collected by the vehicle.

[0075] In this embodiment, the execution entity is an electronic device, which is installed in the vehicle and acquires the image to be detected collected by the vehicle.

[0076] Step 402: Use an object detection model to perform object detection on the image to be detected, and obtain the output result of the object detection model. The output result contains three-dimensional object information, which includes three-dimensional detection box information predicted for each object in the image to surround the object. The object detection model is obtained based on the model training method described above.

[0077] In this embodiment, the target detection model performs target detection on the image and obtains the output result of the target detection model. The output result includes target 3D information, which includes 3D detection box information predicted for each object in the image to surround the object. The target detection model is obtained based on the model training method.

[0078] Step 403: Control the vehicle to drive based on the three-dimensional detection box information of the object.

[0079] In this embodiment, the three-dimensional detection box information includes the three-dimensional dimensions of the object (including length, width, and height) and the object's coordinate information. Based on the object's three-dimensional detection box information, the vehicle is controlled to drive. Specifically, after an object is detected, the core is to complete the closed-loop action of perception-decision-control, from target recognition and parameter parsing at the perception layer, to risk assessment and behavior planning at the decision layer, and then to vehicle execution operation at the control layer.

[0080] The results output by the object detection model also include the depth value of the object, the orientation angle of the object, key points, and category. Key points are the center points of the two-dimensional detection boxes of the object in the image, and the category refers to what kind of object the detected object is.

[0081] It can accurately identify targets in front of the vehicle, respond to targets in real time, and take timely actions such as obstacle avoidance and speed adjustment to reduce the risk of collision; making vehicle driving control more intelligent and adaptive, and adaptable to complex and ever-changing traffic scenarios.

[0082] Figure 5 A structural diagram of the model training apparatus provided in an embodiment of this application is shown. Figure 5 As shown, the model training device 500 includes: The first acquisition module 501 is used to acquire multiple first training samples. Each first training sample includes a first image and two-dimensional information corresponding to the first image. The two-dimensional information includes two-dimensional reference box information that surrounds each target object in the first image. The multiple first images include a distant view image in front of the vehicle. The training module 502 is used to input the first image into the initial model for each first training sample during the training of the initial model using the first training samples, and obtain the detection result output by the initial model. The detection result includes three-dimensional information, and the three-dimensional information includes three-dimensional detection box information predicted for each target object in the first image to surround the target object. The determination module 503 is used to determine the model loss value of the initial model based on the two-dimensional reference box information and the corresponding three-dimensional detection box information of each target object. The processing module 504 is used to adjust the model parameters of the initial model according to the model loss value until the initial model meets the preset iteration stopping condition to obtain the target detection model, which is used to perform target detection on the input image.

[0083] In one embodiment of this application, the determining module 503 is further configured to, for each target object, determine the two-dimensional virtual bounding box information of the target object based on the three-dimensional detection bounding box information of the target object; and obtain the model loss value based on the two-dimensional virtual bounding box information of the target object and the two-dimensional reference bounding box information of the target object.

[0084] In one embodiment of this application, the determining module 503 is further configured to normalize the depth value corresponding to the target object to obtain a first depth value corresponding to the target object; determine the three-dimensional coordinates of the center point of the target object in the camera coordinate system based on the first depth value corresponding to the target object, the coordinates of the key points of the target object, and preset camera intrinsic parameters; determine the three-dimensional coordinate set of the target object in the camera coordinate system based on the three-dimensional size of the target object, the orientation angle corresponding to the target object, and the three-dimensional coordinates of the center point of the target object, the three-dimensional coordinate set including the three-dimensional coordinates of each of the eight corner points, the space formed by the eight corner points being used to surround the target object; determine the coordinates of the first projection point of each of the eight corner points projected onto the first image based on the three-dimensional coordinates of the eight corner points and the camera intrinsic parameters; obtain the coordinates of a two-dimensional virtual frame based on the first projection point coordinates corresponding to each corner point, and use the coordinates of the two-dimensional virtual frame as the two-dimensional virtual frame information, the two-dimensional virtual frame being used to surround the first projection point coordinates corresponding to each corner point.

[0085] In one embodiment of this application, the determining module 503 is further configured to determine a first loss value based on the coordinates of the two-dimensional virtual bounding box of the target object and the coordinates of the two-dimensional reference bounding box of the target object; determine a second loss value based on the distance from each edge of the two-dimensional detection box of the target object to the third projection point and the distance from each edge of the two-dimensional reference bounding box of the target object to the second projection point; and determine the model loss value based on the first loss value and the second loss value.

[0086] In one embodiment of this application, the determining module 503 is further configured to determine a third loss value based on the target category and the reference category of the target object; calculate and determine a fourth loss value based on the coordinates of the key points of the target object and the coordinates of the reference points of the target object; and obtain the model loss value based on the first loss value, the second loss value, the third loss value, and the fourth loss value.

[0087] In one embodiment of this application, the acquisition module is further configured to acquire first data, the first data including multiple original images and two-dimensional reference box information corresponding to the multiple original images, the multiple original images including at least one target object and an original background region, the multiple original images including a distant view image in front of the vehicle and a close-up image in front of the vehicle; performing the following processing on each of the multiple original images: preprocessing the original image to obtain a second image, the preprocessing including image cropping and image magnification, the second image including at least one target object and a local background region, the local background region being a distant region in the original background region; obtaining two-dimensional reference box information corresponding to the second image according to a target transformation matrix and the two-dimensional reference box information corresponding to the original image, the target transformation matrix being determined according to the mapping relationship between the pixel coordinates of the original image and the pixel coordinates of the second image; using the multiple original images and the multiple second images as the first image; using each first image and the two-dimensional reference box information corresponding to the first image as a training sample.

[0088] The model training apparatus provided in this application embodiment can implement all the processes implemented in the aforementioned model training method embodiment and achieve the same technical effect. To avoid repetition, it will not be described again here.

[0089] Figure 6 A structural diagram of the vehicle control device provided in an embodiment of this application is shown. Figure 5 As shown, the vehicle control device 600 includes: The second acquisition module 601 is used to acquire the image to be detected collected by the vehicle; The detection module 602 is used to perform target detection on the image to be detected using a target detection model, and obtain the output result of the target detection model. The output result contains target three-dimensional information, and the target three-dimensional information includes three-dimensional detection box information predicted for each object in the image to surround the object. The target detection model is obtained based on the model training method according to any one of claims 1 to 6. The control module 603 is used to control the vehicle to move based on the three-dimensional detection frame information of the object.

[0090] The vehicle control device provided in this application embodiment can realize the various processes implemented in the aforementioned vehicle control method embodiment and achieve the same technical effect. To avoid repetition, it will not be described again here.

[0091] Figure 7 A schematic diagram of the hardware structure of the electronic device provided in an embodiment of this application is shown.

[0092] The electronic device may include a processor 701 and a memory 702 storing computer program instructions.

[0093] Specifically, the processor 701 may include a central processing unit (CPU), an application-specific integrated circuit (ASIC), or one or more integrated circuits that can be configured to implement the embodiments of this application.

[0094] Memory 702 may include mass storage for data or instructions. For example, and not limitingly, memory 702 may include a hard disk drive (HDD), floppy disk drive, flash memory, optical disk, magneto-optical disk, magnetic tape, or Universal Serial Bus (USB) drive, or a combination of two or more of these. Where appropriate, memory 702 may include removable or non-removable (or fixed) media. Where appropriate, memory 702 may be internal or external to the integrated gateway disaster recovery device. In a particular embodiment, memory 702 is non-volatile solid-state memory.

[0095] Memory may include read-only memory (ROM), random access memory (RAM), disk storage media devices, optical storage media devices, flash memory devices, and electrical, optical, or other physical / tangible memory storage devices. Therefore, typically, memory includes one or more tangible (non-transitory) computer-readable storage media (e.g., memory devices) encoded with software including computer-executable instructions, and when the software is executed (e.g., by one or more processors), it is operable to perform the operations described with reference to the method according to the first aspect of this disclosure.

[0096] The processor 701 implements any of the methods described above in the above embodiments by reading and executing computer program instructions stored in the memory 702.

[0097] In one example, the electronic device may also include a communication interface 703 and a bus 710. For example, Figure 7 As shown, the processor 701, memory 702, and communication interface 703 are connected through bus 710 and complete communication with each other.

[0098] The communication interface 703 is mainly used to realize communication between various modules, devices, units and / or equipment in the embodiments of this application.

[0099] Bus 710 includes hardware, software, or both, that couples components of a method or electronic device as described above together. For example, and not as a limitation, the bus may include an Accelerated Graphics Port (AGP) or other graphics bus, an Enhanced Industry Standard Architecture (EISA) bus, a Front Side Bus (FSB), HyperTransport (HT) interconnect, an Industry Standard Architecture (ISA) bus, an Infinite Bandwidth Interconnect, a Low Pin Count (LPC) bus, a memory bus, a Microchannel Architecture (MCA) bus, a Peripheral Component Interconnect (PCI) bus, a PCI-Express (PCI-X) bus, a Serial Advanced Technology Attachment (SATA) bus, a Video Electronics Standards Association Local (VLB) bus, or other suitable buses, or combinations of two or more of these. Where appropriate, bus 710 may include one or more buses. Although specific buses are described and illustrated in embodiments of this application, any suitable bus or interconnect is contemplated herein.

[0100] In addition, this application provides a vehicle that includes the aforementioned electronic equipment.

[0101] Alternatively, embodiments of this application can be implemented using a computer storage medium. This computer storage medium stores computer program instructions; when these computer program instructions are executed by a processor, they implement any of the model training methods described in the above embodiments.

[0102] Alternatively, this application embodiment can provide a computer program product for implementation, wherein the instructions in the computer program product, when executed by the processor of an electronic device, cause the electronic device to implement any of the model training methods in the above embodiments.

[0103] It should be clarified that this application is not limited to the specific configurations and processes described above and shown in the figures. For the sake of brevity, detailed descriptions of known methods are omitted here. In the above embodiments, several specific steps are described as examples. However, the method process of this application is not limited to the specific steps described. Those skilled in the art can make various changes, modifications, and additions, or change the order of steps, after understanding the spirit of this application.

[0104] The functional blocks shown in the above-described structural diagram can be implemented as hardware, software, firmware, or a combination thereof. When implemented in hardware, they can be, for example, electronic circuits, application-specific integrated circuits (ASICs), appropriate firmware, plug-ins, function cards, etc. When implemented in software, the elements of this application are programs or code segments used to perform the required tasks. Programs or code segments can be stored on a machine-readable medium or transmitted over a transmission medium or communication link via data signals carried on a carrier wave. "Machine-readable medium" can include any medium capable of storing or transmitting information. Examples of machine-readable media include electronic circuits, semiconductor memory devices, ROM, flash memory, erasable ROM (EROM), floppy disks, CD-ROMs, optical disks, hard disks, fiber optic media, radio frequency (RF) links, etc. Code segments can be downloaded via computer networks such as the Internet, intranets, etc.

[0105] It should also be noted that the exemplary embodiments mentioned in this application describe methods or systems based on a series of steps or apparatus. However, this application is not limited to the order of the above steps; that is, the steps can be performed in the order mentioned in the embodiments, or in a different order, or several steps can be performed simultaneously.

[0106] The aspects of this disclosure have been described above with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this disclosure. It should be understood that each block in the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing apparatus to produce a machine such that these instructions, executable via the processor of the computer or other programmable data processing apparatus, enable the implementation of the functions / actions specified in one or more blocks of the flowchart illustrations and / or block diagrams. Such a processor can be, but is not limited to, a general-purpose processor, a special-purpose processor, a special application processor, or a field-programmable logic circuit. It is also understood that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, can also be implemented by special-purpose hardware performing the specified functions or actions, or can be implemented by a combination of special-purpose hardware and computer instructions.

[0107] The above description is merely a specific implementation of this application. Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working processes of the systems, modules, and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here. It should be understood that the protection scope of this application is not limited thereto. Any person skilled in the art can easily conceive of various equivalent modifications or substitutions within the technical scope disclosed in this application, and these modifications or substitutions should all be covered within the protection scope of this application.

Claims

1. A model training method, characterized in that, The method includes: Multiple first training samples are acquired. Each first training sample includes a first image and two-dimensional information corresponding to the first image. The two-dimensional information includes two-dimensional reference box information that surrounds each target object in the first image. The multiple first images include a distant view image in front of the vehicle. During the training of the initial model using the first training samples, for each first training sample, the first image is input into the initial model to obtain the detection result output by the initial model. The detection result includes three-dimensional information, which includes three-dimensional detection box information predicted for each target object in the first image to surround the target object. The model loss value of the initial model is determined based on the two-dimensional reference box information and the corresponding three-dimensional detection box information of each target object. Based on the model loss value, the model parameters of the initial model are adjusted until the initial model meets the preset iteration stopping condition to obtain the target detection model, which is used to perform target detection on the input image.

2. The model training method according to claim 1, characterized in that, The step of determining the model loss value of the initial model based on the two-dimensional reference box information and the corresponding three-dimensional detection box information of each target object includes: For each target object, the two-dimensional virtual bounding box information of the target object is determined based on the three-dimensional detection bounding box information of the target object; The model loss value is obtained based on the two-dimensional virtual bounding box information of the target object and the two-dimensional reference bounding box information of the target object.

3. The model training method according to claim 2, characterized in that, The detection results also include the coordinates of the key points of the target object predicted for each target object in the first image, the depth value corresponding to the target object, the orientation angle corresponding to the target object, the two-dimensional detection box information of the target object, and the three-dimensional dimensions of the target object, wherein the key point is the center point of the two-dimensional detection box of the target object in the first image; Determining the two-dimensional virtual bounding box information of the target object based on the three-dimensional detection bounding box information of the target object includes: The depth value corresponding to the target object is normalized to obtain the first depth value corresponding to the target object. Based on the first depth value corresponding to the target object, the coordinates of the key points of the target object, and the preset camera intrinsic parameters, determine the three-dimensional coordinates of the center point of the target object in the camera coordinate system; Based on the three-dimensional dimensions of the target object, the orientation angle corresponding to the target object, and the three-dimensional coordinates of the center point of the target object, a set of three-dimensional coordinates of the target object in the camera coordinate system is determined. The set of three-dimensional coordinates includes the three-dimensional coordinates of each of the eight corner points, and the space formed by the eight corner points is used to surround the target object. Based on the three-dimensional coordinates of the eight corner points and the camera intrinsic parameters, determine the coordinates of the first projection point of each of the eight corner points projected onto the first image; Based on the coordinates of the first projection point corresponding to each corner point, the coordinates of the two-dimensional virtual frame are obtained, and the coordinates of the two-dimensional virtual frame are used as the information of the two-dimensional virtual frame. The two-dimensional virtual frame is used to enclose the coordinates of the first projection point corresponding to each corner point.

4. The model training method according to claim 2, characterized in that, The two-dimensional virtual bounding box information of the target object includes the coordinates of the two-dimensional virtual bounding box of the target object; the two-dimensional reference box information of the target object includes the coordinates of the two-dimensional reference box of the target object and the distance from each edge of the two-dimensional reference box to the second projection point; the detection result also includes the distance from each edge of the two-dimensional detection box used to surround the target object to the third projection point predicted for each target object in the first image. The step of obtaining the model loss value based on the two-dimensional virtual bounding box information and the two-dimensional reference bounding box information of the target object includes: A first loss value is determined based on the coordinates of the two-dimensional virtual bounding box of the target object and the coordinates of the two-dimensional reference bounding box of the target object; The second loss value is determined based on the distance from each edge of the two-dimensional detection box of the target object to the third projection point and the distance from each edge of the two-dimensional reference box of the target object to the second projection point. The model loss value is determined based on the first loss value and the second loss value.

5. The model training method according to claim 4, characterized in that, The detection results also include the coordinates of key points of each target object and the target category of each target object, wherein the key point is the center point of the two-dimensional detection box of the target object in the first image; the first training sample also includes the reference category of each target object and the coordinates of the reference point of each target object, wherein the reference point is the center point of the two-dimensional reference box of the target object in the first image; Determining the model loss value based on the first loss value and the second loss value includes: A third loss value is determined based on the target category and the reference category of the target object; The fourth loss value is determined by calculating the coordinates of the key points of the target object and the coordinates of the reference point of the target object. The model loss value is obtained based on the first loss value, the second loss value, the third loss value, and the fourth loss value.

6. The model training method according to any one of claims 1 to 5, characterized in that, The acquisition of multiple first training samples includes: Acquire first data, which includes multiple original images and two-dimensional reference box information corresponding to the multiple original images. The multiple original images include at least one target object and an original background area. The multiple original images include a distant view image in front of the vehicle and a close-up image in front of the vehicle. Each of the plurality of original images is processed as follows: the original image is preprocessed to obtain a second image, the preprocessing including image cropping and image magnification, the second image including at least one target object and a local background region, the local background region being the distant region in the original background region; the two-dimensional reference box information corresponding to the second image is obtained according to the target transformation matrix and the two-dimensional reference box information corresponding to the original image, the target transformation matrix being determined according to the mapping relationship between the pixel coordinates of the original image and the pixel coordinates of the second image; The first image is formed by combining the original images and the second images. Each of the first images and its corresponding two-dimensional reference box information is used as a training sample.

7. A vehicle control method, characterized in that, The method is applied to an electronic device, and the method includes: Acquire the images to be detected collected by the vehicle; An object detection model is used to detect objects in the image to be detected, and the output result of the object detection model is obtained. The output result contains three-dimensional information of the object. The three-dimensional information of the object includes three-dimensional detection box information predicted for each object in the image to surround the object. The object detection model is obtained based on the model training method according to any one of claims 1 to 6. The vehicle is controlled to move based on the three-dimensional detection bounding box information of the object.

8. A model training device, characterized in that, The device includes: The first acquisition module is used to acquire multiple first training samples. Each first training sample includes a first image and two-dimensional information corresponding to the first image. The two-dimensional information includes two-dimensional reference box information that surrounds each target object in the first image. The multiple first images include a distant view image in front of the vehicle. The training module is used to input the first image into the initial model for each of the first training samples during the training of the initial model using the first training samples, and obtain the detection result output by the initial model. The detection result includes three-dimensional information, which includes three-dimensional detection box information predicted for each target object in the first image to surround the target object. The determination module is used to determine the model loss value of the initial model based on the two-dimensional reference box information and the corresponding three-dimensional detection box information of each target object; The processing module is used to adjust the model parameters of the initial model according to the model loss value until the initial model meets the preset iteration stopping condition to obtain the target detection model, which is used to perform target detection on the input image.

9. A vehicle control device, characterized in that, The device includes: The second acquisition module is used to acquire the image to be detected collected by the vehicle; A detection module is used to perform target detection on the image to be detected using a target detection model, and to obtain the output result of the target detection model. The output result contains target three-dimensional information, which includes three-dimensional detection box information predicted for each object in the image to surround the object. The target detection model is obtained based on the model training method according to any one of claims 1 to 6. The control module is used to control the vehicle's movement based on the object's three-dimensional detection frame information.

10. An electronic device, characterized in that, include: Processor and memory storing computer program instructions; When the processor executes the computer program instructions, it implements the model training method as described in any one of claims 1-6, or the vehicle control method as described in claim 7.

11. A vehicle, characterized in that, Including the electronic device as described in claim 10.

12. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer program instructions, which, when executed by a processor, implement the model training method as described in any one of claims 1-6, or the vehicle control method as described in claim 7.