An image processing method, apparatus, storage medium, and computer program product
By using a polar coordinate neural network model and an adaptive attention mechanism, images in different 2D coordinate systems are transformed into 3D scene representations from the perspective of a BEV (Battery Electric Vehicle), solving the problem of large image transformation errors in autonomous driving and achieving accurate 3D scene representation and multi-task support.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- HUAWEI TECH CO LTD
- Filing Date
- 2022-07-01
- Publication Date
- 2026-06-12
Smart Images

Figure CN115273002B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of image processing, and more particularly to an image processing method, apparatus, storage medium, and computer program product. Background Technology
[0002] Visual inspection has been widely used in many fields. However, the development of visual inspection for autonomous driving scenarios is much more difficult than other artificial intelligence (AI) fields. One of the main reasons is that it is necessary to convert the input two-dimensional (2D) image into a three-dimensional (3D) space.
[0003] Existing visual detection methods typically ignore the coordinate system of the input image (e.g., image classification tasks) or perform predictions in the same coordinate system as the input image (e.g., image segmentation and object detection tasks). However, in autonomous driving scenarios, with the increasing number of onboard cameras, a key challenge for achieving further performance improvements in autonomous driving systems is how to transform multiple input images in different 2D coordinate systems into a vehicle-centric 3D space for subsequent downstream tasks such as 3D object detection or lane segmentation.
[0004] Current methods for converting 2D images to 3D space all have significant errors, making it difficult to generate a unified, accurate, and dense 3D scene representation from images acquired in different 2D coordinate systems. Summary of the Invention
[0005] In view of this, an image processing method, apparatus, storage medium, and computer program product are proposed.
[0006] In a first aspect, embodiments of this application provide an image processing method, the method comprising: acquiring a two-dimensional image acquired by a first image acquisition device; the first image acquisition device being any image acquisition device installed on a vehicle; extracting features from the two-dimensional image using a neural network model, and determining features corresponding to at least one scene point from the extracted features; wherein the scene point is a preset scene point in a preset scene point set under a bird's-eye view (BEV) perspective, the preset scene point set being distributed in a polar coordinate system with the vehicle as the pole, and the plane containing the preset scene point set being parallel to the ground; the neural network model being trained using training data corresponding to a target task; and executing the target task based on the features corresponding to the at least one scene point.
[0007] Based on the above technical solution, a unified 3D scene near the vehicle from the BEV perspective is modeled in polar coordinates, which is more consistent with the pinhole camera model. Features of the 2D image are extracted through a neural network model, and features corresponding to at least one scene point are determined from the extracted 2D image features. This allows the features of the 2D image required for the preset scene points distributed in polar coordinates to be obtained in reverse. This transforms 2D images from different coordinate systems into a unified, accurate, and dense 3D scene representation from the BEV perspective, avoiding the error accumulation caused by depth estimation and the suboptimal results caused by the lack of geometric constraints in implicit projection. Furthermore, based on the features corresponding to at least one scene point, a target task can be executed. In some examples, there can be multiple target tasks, thus enabling the unified, accurate, and dense 3D scene representation to be applied to multiple target tasks simultaneously.
[0008] According to the first aspect, in a first possible implementation of the first aspect, the at least one scene point includes preset scene points located on the same ray in the preset scene point set, the ray having the pole as its endpoint; the step of extracting features of the two-dimensional image through a neural network model and determining features corresponding to the at least one scene point from the extracted features includes: extracting features of the two-dimensional image through the neural network model and determining features corresponding to the at least one scene point from the extracted features based on an attention mechanism.
[0009] Based on the above technical solution, considering that the probability of an object appearing at a certain angle is relatively high, that is, the probability of a preset scene point on the same ray corresponding to the features of the same object is relatively high, an adaptive attention mechanism is applied to the preset scene points on the same ray. That is, by using an adaptive attention mechanism to constrain the preset scene points on the same ray, the relationship between the preset scene points on the same ray is calculated, thereby better suppressing erroneous 3D scene information, more accurately determining the features of the 2D image corresponding to the preset scene point, and helping to make the obtained 3D scene expression more accurate.
[0010] According to the first aspect or the first possible implementation of the first aspect, in the second possible implementation of the first aspect, the step of extracting features from the two-dimensional image through a neural network model and determining features corresponding to at least one scene point from the extracted features includes: extracting features from the two-dimensional image through the neural network model to obtain an image feature set; wherein the image feature set includes features corresponding to multiple positions on the two-dimensional image; determining the three-dimensional coordinates corresponding to the at least one scene point through the neural network model; mapping the three-dimensional coordinates to the coordinate system of the image acquisition device according to the three-dimensional coordinates and the calibration information of the first image acquisition device, and determining the target position corresponding to the three-dimensional coordinates among the multiple positions; and obtaining the features corresponding to the at least one scene point according to the features corresponding to the target position in the image feature set.
[0011] Based on the above technical solution, by utilizing the one-to-one projection relationship determined by the predefined preset scene points and the calibration information of the image acquisition device, the 2D semantic information on the two-dimensional images acquired by the image acquisition devices located in different coordinate systems is filled into the preset scene points, thereby realizing the transformation of 2D images in different coordinate systems into a unified, accurate, and dense 3D scene expression under the BEV perspective.
[0012] According to the second possible implementation of the first aspect, in the third possible implementation of the first aspect, obtaining the features corresponding to the at least one scene point based on the features corresponding to the target position in the image feature set includes: based on the features corresponding to the target position in the image feature set, repeatedly executing the determination of the three-dimensional coordinates and subsequent operations corresponding to the at least one scene point based on an attention mechanism until a preset number of loops is reached; obtaining the features corresponding to the at least one scene point based on the features corresponding to the target position when the preset number of loops is reached.
[0013] Based on the above technical solution, by utilizing the one-to-one projection relationship determined by the predefined preset scene points and the calibration information of the image acquisition device, the preset scene points are accurately projected onto the specific positions of the two-dimensional image. At the same time, based on the adaptive attention mechanism of polar coordinates, after multiple layers of iterative encoding (i.e., after a preset number of loops), the 2D semantic information corresponding to the preset scene points is accurately obtained. The 2D semantic information on the two-dimensional images acquired by the image acquisition devices located in different coordinate systems is filled into the preset scene points, thereby realizing the transformation of 2D images in different coordinate systems into a unified, accurate, and dense 3D scene expression from the BEV perspective.
[0014] According to the first aspect or the various possible implementations of the first aspect described above, in the fourth possible implementation of the first aspect, each preset scene point in the preset scene point set is evenly distributed in the polar coordinate system.
[0015] Based on the above technical solution, by predefining preset scene points uniformly distributed in polar coordinates, compared with depth estimation and implicit projection, the performance loss caused by pixel-level depth prediction and inconsistent projection relationships can be avoided, which helps to obtain a more accurate 3D scene expression.
[0016] According to the first aspect or the various possible implementations of the first aspect described above, in the fifth possible implementation of the first aspect, the method further includes: acquiring training data corresponding to the target task; the training data includes two-dimensional sample images acquired by at least one image acquisition device of the vehicle; and using the training data and the preset scene point set to train a preset model to obtain the neural network model.
[0017] Based on the above technical solution, by pre-defining preset scene points distributed in polar coordinates to learn the 3D scene representation of the vehicle, the semantic information obtained by the trained neural network model is more accurate; and accurate 3D scene representation can be learned without a deep prediction network; in addition, the trained neural network model can transform multiple 2D images into a unified, accurate, and dense 3D scene representation from the BEV perspective; it solves the error and sparsity problems in 3D scene representation that may be caused by depth estimation and implicit projection methods, and the generated 3D scene representation can be used simultaneously for multiple downstream autonomous driving tasks such as subsequent 3D object detection and BEV semantic segmentation.
[0018] According to the fifth possible implementation of the first aspect, in the sixth possible implementation of the first aspect, the step of training the preset model using the training data and the preset scene point set to obtain the neural network model includes: extracting training features of the two-dimensional sample image through the preset model, and determining training features corresponding to the at least one scene point from the extracted training features; executing the target task according to the training features corresponding to the at least one scene point, and adjusting the parameters of the preset model according to the execution result until the preset training termination condition is reached.
[0019] Based on the above technical solution, the training features of the two-dimensional sample image are extracted through the preset model, and the training features corresponding to the at least one scene point are determined from the extracted training features, thereby realizing the reverse acquisition of the features of the 2D image corresponding to at least one scene point.
[0020] According to the sixth possible implementation of the first aspect, in the seventh possible implementation of the first aspect, the step of extracting training features of the two-dimensional sample image through the preset model and determining training features corresponding to the at least one scene point from the extracted training features includes: obtaining each scene point in the preset scene point set that is located on the same ray as the at least one scene point; extracting training features of the two-dimensional sample image through the preset model, and determining training features corresponding to each scene point from the extracted training features based on an attention mechanism.
[0021] Based on the above technical solution, for preset scene points on the same ray, the attention mechanism is used to help the preset model learn a more accurate 3D scene representation.
[0022] According to the first aspect or the various possible implementations of the first aspect described above, in the eighth possible implementation of the first aspect, the step of executing the target task based on the features corresponding to the at least one scene point includes: transforming the at least one scene point into a Cartesian coordinate system to obtain the coordinates of the at least one scene point in the Cartesian coordinate system; and executing the target task based on the features corresponding to the at least one scene point and the coordinates of the at least one scene point in the Cartesian coordinate system.
[0023] Based on the above technical solution, the 3D scene representation defined in polar coordinates is transformed into the Cartesian coordinate system in order to execute subsequent downstream tasks.
[0024] According to the first aspect or the various possible implementations of the first aspect described above, in the ninth possible implementation of the first aspect, the target task includes one or more of image classification, semantic segmentation, or object detection.
[0025] Based on the above technical solution, it can be applied to a single downstream task or to multiple downstream tasks simultaneously.
[0026] Secondly, embodiments of this application provide an image processing apparatus, the apparatus comprising: an acquisition module, configured to acquire a two-dimensional image acquired by a first image acquisition device; the first image acquisition device being any image acquisition device installed on a vehicle; a feature determination module, configured to extract features from the two-dimensional image using a neural network model, and determine features corresponding to at least one scene point from the extracted features; wherein the scene point is a preset scene point in a preset scene point set under a bird's-eye view of a BEV, the preset scene point set being distributed in a polar coordinate system with the vehicle as the pole, and the plane containing the preset scene point set being parallel to the ground; the neural network model being trained using training data corresponding to a target task; and an execution module, configured to execute the target task based on the features corresponding to the at least one scene point.
[0027] Based on the above technical solution, a unified 3D scene near the vehicle from the BEV perspective is modeled in polar coordinates, which is more consistent with the pinhole camera model. Features of the 2D image are extracted through a neural network model, and features corresponding to at least one scene point are determined from the extracted 2D image features. This allows the features of the 2D image required for the preset scene points distributed in polar coordinates to be obtained in reverse. This transforms 2D images from different coordinate systems into a unified, accurate, and dense 3D scene representation from the BEV perspective, avoiding the error accumulation caused by depth estimation and the suboptimal results caused by the lack of geometric constraints in implicit projection. Furthermore, based on the features corresponding to at least one scene point, a target task can be executed. In some examples, there can be multiple target tasks, thus enabling the unified, accurate, and dense 3D scene representation to be applied to multiple target tasks simultaneously.
[0028] According to the second aspect, in a first possible implementation of the second aspect, the at least one scene point includes preset scene points located on the same ray in the preset scene point set, the ray having the pole as its endpoint; the feature determination module is further configured to extract features of the two-dimensional image through the neural network model, and determine the features corresponding to the at least one scene point from the extracted features based on an attention mechanism.
[0029] Based on the above technical solution, considering that the probability of an object appearing at a certain angle is relatively high, that is, the probability of a preset scene point on the same ray corresponding to the features of the same object is relatively high, an adaptive attention mechanism is applied to the preset scene points on the same ray. That is, by using an adaptive attention mechanism to constrain the preset scene points on the same ray, the relationship between the preset scene points on the same ray is calculated, thereby better suppressing erroneous 3D scene information, more accurately determining the features of the 2D image corresponding to the preset scene point, and helping to make the obtained 3D scene expression more accurate.
[0030] According to the second aspect or the first possible implementation of the second aspect, in the second possible implementation of the second aspect, the feature determination module is further configured to: extract features from the two-dimensional image using the neural network model to obtain an image feature set; wherein the image feature set includes features corresponding to multiple locations on the two-dimensional image; determine the three-dimensional coordinates corresponding to the at least one scene point using the neural network model; map the three-dimensional coordinates to the coordinate system of the image acquisition device according to the three-dimensional coordinates and the calibration information of the first image acquisition device to determine the target position corresponding to the three-dimensional coordinates among the multiple locations; and obtain the features corresponding to the at least one scene point according to the features corresponding to the target position in the image feature set.
[0031] Based on the above technical solution, by utilizing the one-to-one projection relationship determined by the predefined preset scene points and the calibration information of the image acquisition device, the 2D semantic information on the two-dimensional images acquired by the image acquisition devices located in different coordinate systems is filled into the preset scene points, thereby realizing the transformation of 2D images in different coordinate systems into a unified, accurate, and dense 3D scene expression under the BEV perspective.
[0032] According to the second possible implementation of the second aspect, in the third possible implementation of the second aspect, the feature determination module is further configured to: based on the features corresponding to the target position in the image feature set, repeatedly execute the determination of the three-dimensional coordinates corresponding to the at least one scene point and subsequent operations based on an attention mechanism until a preset number of loops is reached; and obtain the features corresponding to the at least one scene point based on the features corresponding to the target position when the preset number of loops is reached.
[0033] Based on the above technical solution, by utilizing the one-to-one projection relationship determined by the predefined preset scene points and the calibration information of the image acquisition device, the preset scene points are accurately projected onto the specific positions of the two-dimensional image. At the same time, based on the adaptive attention mechanism of polar coordinates, after multiple layers of iterative encoding (i.e., after a preset number of loops), the 2D semantic information corresponding to the preset scene points is accurately obtained. The 2D semantic information on the two-dimensional images acquired by the image acquisition devices located in different coordinate systems is filled into the preset scene points, thereby realizing the transformation of 2D images in different coordinate systems into a unified, accurate, and dense 3D scene expression from the BEV perspective.
[0034] According to the second aspect or the various possible implementations of the second aspect described above, in the fourth possible implementation of the second aspect, each preset scene point in the preset scene point set is evenly distributed in the polar coordinate system.
[0035] Based on the above technical solution, by predefining preset scene points uniformly distributed in polar coordinates, compared with depth estimation and implicit projection, the performance loss caused by pixel-level depth prediction and inconsistent projection relationships can be avoided, which helps to obtain a more accurate 3D scene expression.
[0036] According to the second aspect or the various possible implementations of the second aspect described above, in the fifth possible implementation of the second aspect, the apparatus further includes: a training module, used to acquire training data corresponding to the target task; the training data includes two-dimensional sample images acquired by at least one image acquisition device of the vehicle; the training module is further used to train a preset model using the training data and the preset scene point set to obtain the neural network model.
[0037] Based on the above technical solution, by pre-defining preset scene points distributed in polar coordinates to learn the 3D scene representation of the vehicle, the semantic information obtained by the trained neural network model is more accurate; and accurate 3D scene representation can be learned without a deep prediction network; in addition, the trained neural network model can transform multiple 2D images into a unified, accurate, and dense 3D scene representation from the BEV perspective; it solves the error and sparsity problems in 3D scene representation that may be caused by depth estimation and implicit projection methods, and the generated 3D scene representation can be used simultaneously for multiple downstream autonomous driving tasks such as subsequent 3D object detection and BEV semantic segmentation.
[0038] According to the fifth possible implementation of the second aspect, in the sixth possible implementation of the second aspect, the training module is further configured to: extract training features of the two-dimensional sample image through the preset model, and determine the training features corresponding to the at least one scene point from the extracted training features; execute the target task according to the training features corresponding to the at least one scene point, and adjust the parameters of the preset model according to the execution result until the preset training termination condition is reached.
[0039] Based on the above technical solution, the training features of the two-dimensional sample image are extracted through the preset model, and the training features corresponding to the at least one scene point are determined from the extracted training features, thereby realizing the reverse acquisition of the features of the 2D image corresponding to at least one scene point.
[0040] According to the sixth possible implementation of the second aspect, in the seventh possible implementation of the second aspect, the training module is further configured to: obtain each scene point in the preset scene point set that is located on the same ray as the at least one scene point; extract training features of the two-dimensional sample image through the preset model, and determine the training features corresponding to each scene point from the extracted training features based on the attention mechanism.
[0041] Based on the above technical solution, for preset scene points on the same ray, the attention mechanism is used to help the preset model learn a more accurate 3D scene representation.
[0042] According to the second aspect or the various possible implementations of the second aspect described above, in the eighth possible implementation of the second aspect, the execution module is further configured to: transform the at least one scene point into a Cartesian coordinate system to obtain the coordinates of the at least one scene point in the Cartesian coordinate system; and execute the target task according to the features corresponding to the at least one scene point and the coordinates of the at least one scene point in the Cartesian coordinate system.
[0043] Based on the above technical solution, the 3D scene representation defined in polar coordinates is transformed into the Cartesian coordinate system in order to execute subsequent downstream tasks.
[0044] According to the second aspect or the various possible implementations of the second aspect described above, in the ninth possible implementation of the second aspect, the target task includes one or more of image classification, semantic segmentation, or object detection.
[0045] Based on the above technical solution, it can be applied to a single downstream task or to multiple downstream tasks simultaneously.
[0046] Thirdly, embodiments of this application provide an image processing apparatus, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to implement one or more image processing methods of the first aspect when executing the instructions.
[0047] Fourthly, embodiments of this application provide a computer-readable storage medium having computer program instructions stored thereon, which, when executed by a processor, implement one or more of the image processing methods of the first aspect.
[0048] Fifthly, embodiments of this application provide a computer program product that, when run on a computer, causes the computer to perform one or more of the image processing methods described in the first aspect.
[0049] For the technical effects of the third to fifth aspects mentioned above, please refer to the first or second aspect mentioned above. Attached Figure Description
[0050] The accompanying drawings, which are included in and form part of this specification, illustrate exemplary embodiments, features, and aspects of this application together with the specification and serve to explain the principles of this application.
[0051] Figure 1 A schematic diagram illustrating a depth estimation according to an embodiment of this application is shown.
[0052] Figure 2 A schematic diagram of an implicit projection according to an embodiment of this application is shown.
[0053] Figure 3 A schematic diagram illustrating the conversion of a 2D image to 3D space according to an embodiment of this application is shown.
[0054] Figure 4 This diagram illustrates the architecture of an autonomous driving system according to an embodiment of the present application.
[0055] Figure 5A flowchart illustrating an image processing method according to an embodiment of this application is shown.
[0056] Figure 6 This diagram illustrates a predefined distribution of scene points in polar coordinates according to an embodiment of the present application.
[0057] Figure 7 A schematic diagram of a 3D object detection task according to an embodiment of this application is shown.
[0058] Figure 8 A schematic diagram of a BEV semantic segmentation task according to an embodiment of this application is shown.
[0059] Figure 9 A flowchart illustrating an image processing method according to an embodiment of this application is shown.
[0060] Figure 10 A schematic diagram of an image processing procedure according to an embodiment of this application is shown.
[0061] Figure 11 A flowchart illustrating an image processing method according to an embodiment of this application is shown.
[0062] Figure 12 This diagram illustrates a model training process according to an embodiment of the present application.
[0063] Figure 13 A block diagram of an image processing apparatus according to an embodiment of this application is shown.
[0064] Figure 14 A schematic diagram of the structure of an image processing apparatus according to an embodiment of this application is shown. Detailed Implementation
[0065] Various exemplary embodiments, features, and aspects of this application will now be described in detail with reference to the accompanying drawings. The same reference numerals in the drawings denote elements that have the same or similar functions. Although various aspects of the embodiments are shown in the drawings, they are not necessarily drawn to scale unless specifically indicated otherwise.
[0066] References to "one embodiment" or "some embodiments" as described in this specification mean that one or more embodiments of this application include a specific feature, structure, or characteristic described in connection with that embodiment. Therefore, the phrases "in one embodiment," "in some embodiments," "in other embodiments," "in still other embodiments," etc., appearing in different parts of this specification do not necessarily refer to the same embodiment, but rather mean "one or more, but not all, embodiments," unless otherwise specifically emphasized. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless otherwise specifically emphasized.
[0067] In this application, "at least one" means one or more, and "more than one" means two or more. "And / or" describes the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can mean: A alone, A and B simultaneously, and B alone, where A and B can be singular or plural. The character " / " generally indicates that the preceding and following related objects are in an "or" relationship. "At least one of the following" or similar expressions refer to any combination of these items, including any combination of single or plural items. For example, at least one of a, b, or c can mean: a, b, c, ab, ac, bc, or abc, where a, b, and c can be single or multiple.
[0068] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.
[0069] The term “exemplary” as used herein means “serving as an example, embodiment, or illustration.” Any embodiment illustrated herein as “exemplary” is not necessarily to be construed as superior to or better than other embodiments.
[0070] Furthermore, to better illustrate this application, numerous specific details are provided in the following detailed embodiments. Those skilled in the art should understand that this application can be implemented even without certain specific details.
[0071] To better understand the solutions of the embodiments of this application, the relevant terms and concepts that may be involved in the embodiments of this application will be introduced below.
[0072] 1. Neural Network Model
[0073] Also known as a neural network, it can be composed of neural units. A neural unit can be a computational unit that takes Xs and an intercept of 1 as input, and the output of the computational unit can be: Where s = 1, 2, ..., n, where n is a natural number greater than 1, Ws is the weight of Xs, and b is the bias of the neural unit. f is the activation function of the neural unit, used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal. The output signal of this activation function can be used as the input of the next convolutional layer. The activation function can be a ReLU function, etc. A neural network is a network formed by connecting many of the above-mentioned individual neural units together, that is, the output of one neural unit can be the input of another neural unit. The input of each neural unit can be connected to the local receptive field of the previous layer to extract the features of the local receptive field, which can be a region composed of several neural units.
[0074] The work of each layer in a neural network can be described by the mathematical expression y = a(Wx + b). From a physical perspective, the work of each layer in a neural network can be understood as transforming the input space (the set of input vectors) to the output space (i.e., from the row space to the column space of a matrix) through five operations on the input space. These five operations include: 1. Dimensionality increase / decrease; 2. Magnification / scaling; 3. Rotation; 4. Translation; 5. "Bending". Operations 1, 2, and 3 are performed by Wx, operation 4 by +b, and operation 5 by a(). The term "space" is used here because the objects being classified are not individual things, but a class of things, and space refers to the set of all individuals of this class of things. Here, W is the weight vector, and each value in this vector represents the weight value of a neuron in that layer of the neural network. This vector W determines the spatial transformation from the input space to the output space mentioned above; that is, the weights W of each layer control how the space is transformed. The purpose of training a neural network is to ultimately obtain the weight matrix of all layers of the trained neural network (a weight matrix formed by the vectors W of many layers). Therefore, the training process of a neural network is essentially about learning how to control spatial transformations, more specifically, learning the weight matrix. Neural network models can include multi-layer perceptrons (MLPs), deep neural networks (DNNs), convolutional neural networks (CNNs), recurrent neural networks (RNNs), and so on.
[0075] 2. Convolutional Neural Networks
[0076] A convolutional neural network (CNN) is a deep neural network with a convolutional structure. A CNN contains a feature extractor consisting of convolutional layers and subsampling layers. This feature extractor can be viewed as a filter, and the convolution process can be seen as performing convolution between the same trainable filter and an input image or a convolutional feature map. A convolutional layer is a layer of neurons in a CNN that performs convolutional processing on the input signal. In a convolutional layer of a CNN, a neuron may only be connected to some of its neighboring neurons. A convolutional layer typically contains several feature maps, each composed of rectangularly arranged neural units. Neural units within the same feature map share weights, which are the convolutional kernel. Shared weights can be understood as the way image information is extracted regardless of location. The underlying principle is that the statistical information of one part of an image is the same as that of other parts. This means that image information learned in one part can also be used in another part. Therefore, the same learned image information can be used for all locations in the image. Within the same convolutional layer, multiple convolutional kernels can be used to extract different image information. Generally, the more kernels there are, the richer the image information reflected by the convolution operation. Convolutional kernels can be initialized as matrices of random size, and during the training of the convolutional neural network, they can learn appropriate weights. Furthermore, sharing weights directly reduces the connections between layers in the convolutional neural network, while also lowering the risk of overfitting.
[0077] 3. Backbone Network
[0078] The basic neural network structure for feature extraction from input images.
[0079] 4. Linear layer
[0080] A neural network layer that implements linear combination or linear transformation of inputs.
[0081] 5. Semantic segmentation
[0082] The process of linking each pixel in an image to a class label.
[0083] 6. Bird's-Eye's View semantic segmentation
[0084] Semantic segmentation of dynamic or static areas can be performed from a bird's-eye view (BEV) perspective. For example, in autonomous driving scenarios, semantic segmentation can be performed on static areas such as drivable areas, lane lines, sidewalks, and crosswalks.
[0085] 7. Attention Mechanism
[0086] Attention mechanisms can quickly extract important features from sparse data. They provide an effective modeling approach for capturing global contextual information through QKV (Queries, Keys, Values). Assuming the input is Q(query), and the context is stored as key-value pairs (K,V), the attention mechanism is essentially a mapping function from the query to a series of key-value pairs (key,value). The essence of the attention function can be described as a mapping from a query to a series of key-value pairs (key,value). Attention essentially assigns a weight coefficient to each element in the sequence, which can also be understood as soft addressing. If each element in the sequence is stored in (K,V) form, then attention performs addressing by calculating the similarity between Q and K. The similarity calculated between Q and K reflects the importance of the extracted V values, i.e., the weights, and then a weighted sum is obtained to obtain the final feature value.
[0087] 8. 3D representation
[0088] In other words, 3D scene representation. For example, in an autonomous driving scenario, a 3D scene can be modeled within the range perceived by the sensors installed on the vehicle, centered on the vehicle, and the scene can be represented in a certain form.
[0089] In related technologies, explicit depth estimation or implicit projection methods are mainly used to convert 2D images into 3D space:
[0090] (1) Depth prediction:
[0091] Figure 1 A schematic diagram illustrating a depth estimation according to an embodiment of this application is shown, as follows: Figure 1 As shown, by predicting the depth of each pixel in the image in the 2D coordinate system of the vehicle sensor and the intrinsic and extrinsic parameter matrix of the camera, the pixels in the 2D image are elevated to 3D coordinates, obtaining the scene features corresponding to the center of the vehicle, thus transforming the 2D image into 3D space. Since depth estimation in unconstrained scenes is prone to error, and this error will further propagate to subsequent processing, thus affecting the final result, this is known as the error propagation problem, which is difficult to avoid in this type of method.
[0092] As an example, an explicit depth prediction network can be used to "upgrade" each pixel in a 2D image to a 3D coordinate point, thereby converting the 2D image into a 3D scene representation from the BEV's perspective. For surround view images input from multiple cameras, the known camera intrinsic and extrinsic parameter matrices are used to transform them into the same 3D coordinate system, enabling dynamic object and static road segmentation tasks related to autonomous driving from the BEV's perspective. This approach requires an additional depth prediction network, and its performance is poor due to the significant error in depth estimation, which accumulates and propagates to subsequent processing. Furthermore, this method is only optimized for BEV segmentation tasks and has poor ability to identify and locate small objects.
[0093] (2) Implicit projection:
[0094] Figure 2 A schematic diagram of an implicit projection according to an embodiment of this application is shown, as follows: Figure 2 As shown, an implicit projection method is used to directly transform the 2D image of the vehicle sensor in the 2D coordinate system to 3D space, obtaining the corresponding 3D features of the scene at the center of the vehicle. Because this method does not utilize intrinsic and extrinsic parameter matrices in the projection process, the resulting 3D representation is structurally inconsistent with the corresponding 2D image. That is, there is no strict one-to-one correspondence between pixels in the 2D image and points in the 3D image across coordinate systems, leading to poor performance and significant errors.
[0095] As an example, this method transforms the 2D semantics of different layers of an image into 3D representations at different distances from the BEV's perspective through direct projection, and then performs subsequent segmentation tasks from the BEV's perspective. For surround-view images input from different cameras, this method makes predictions in different coordinate systems. This method lacks a strict one-to-one correspondence across coordinate systems, resulting in a suboptimal network learning process. Furthermore, image inputs from different coordinate systems are learned and predicted within their own coordinate systems, without being unified to the vehicle's 3D coordinates; therefore, it does not effectively utilize global information.
[0096] As another example, the Detection Transformer (DETR) structure from 2D object detection is applied to 3D scenes to perform 3D object detection tasks on objects in the surrounding scene and learn image semantic information encoded by Residual Networks (ResNet). This approach provides a sparse representation of 3D objects in the scene but lacks a dense representation of the 3D scene surrounding the vehicle, resulting in incomplete structural information. Therefore, this approach cannot be effectively applied to dense downstream tasks, such as BEV semantic segmentation.
[0097] Since the aforementioned methods for converting 2D images to 3D space all have significant errors and ignore the transformation relationships between different coordinate systems, this application provides an image processing method (detailed description below). Figure 3 A schematic diagram illustrating the conversion of a 2D image to 3D space according to an embodiment of this application is shown. Figure 3 As shown, by utilizing predefined empty 3D scene points distributed in polar coordinates, the required 2D image features for these empty 3D scene points are found through tracing back. These 2D image features are then filled into the empty 3D scene points, transforming the 2D image into 3D space and generating a complete, unified, accurate, and dense 3D scene representation centered on the vehicle. Compared to the aforementioned depth estimation and implicit projection methods, the image processing method provided in this application avoids the error accumulation caused by depth estimation and the suboptimal results caused by the lack of geometric constraints in implicit projection.
[0098] For ease of description, the following example illustrates the image processing method provided in this application embodiment by converting 2D image inputs collected by the vehicle's sensors into a 3D scene representation from the BEV's perspective in an autonomous driving system. Figure 4 This diagram illustrates the architecture of an autonomous driving system according to an embodiment of this application; as shown. Figure 4 As shown, an autonomous driving system may include modules such as a perception layer, a planning and decision-making module, and a motion controller.
[0099] The perception module is used to perceive the environment around or inside the vehicle. It can integrate data collected by onboard sensors, such as cameras, LiDAR, millimeter-wave radar, ultrasonic radar, and light sensors, to perceive the environment around or inside the vehicle and transmit the perception results to the planning and decision-making module. For example, the data collected by onboard sensors may include video streams, radar point cloud data, or analyzed structured information or data such as the position, speed, steering angle, and size of people, vehicles, and objects. As an example, the perception module can be configured with a visual perception submodule. The visual perception system can acquire images of the vehicle's surrounding environment captured by the onboard camera, and then process the acquired images to detect objects such as pedestrians, lane lines, vehicles, obstacles, and drivable areas. For example, a neural network model can be used to process the 2D images of the vehicle's surrounding environment captured by the onboard camera to achieve tasks such as 3D object detection and BEV semantic segmentation. For example, this neural network model can be deployed in an onboard computing platform or AI accelerator.
[0100] The planning and decision-making module is used to analyze and make decisions based on the perception results generated by the perception module (e.g., 3D target detection results, BEV semantic segmentation results), plan and generate a control set that meets specific constraints (e.g., vehicle dynamics constraints, collision avoidance, passenger comfort, etc.); and can transmit the control set to the transmission control module.
[0101] The transmission control module is used to control the vehicle's movement according to the control set generated by the planning and decision-making module. For example, based on the control set and combined with the vehicle's dynamic information, it can generate control signals such as steering wheel angle, speed, and acceleration, and control the on-board steering system or engine to execute the control signals, thereby controlling the vehicle's movement.
[0102] For example, the autonomous driving system may also include other functional modules; for example, a positioning module, an interaction module, a communication module, etc. (not shown in the figure), which are not limited thereto. The positioning module can be used to provide the vehicle's location information and attitude information. For example, the positioning module may include a Global Navigation Satellite System (GNSS), an Inertial Navigation System (INS), etc., which can be used to determine the vehicle's location information. The interaction module can be used to send information to the driver and receive driver commands. The communication module can be used for the vehicle to communicate with other devices, which may include mobile terminals, cloud devices, other vehicles, roadside devices, etc., and can be implemented through wireless communication connections such as 2G / 3G / 4G / 5G, Bluetooth, frequency modulation (FM), wireless local area networks (WLAN), long-term evolution (LTE), vehicle-to-everything (V2X), vehicle-to-vehicle (V2V), and long-term evolution vehicle (LTE-V).
[0103] The image processing method provided in this application embodiment can be executed by an image processing device; this application embodiment does not limit the type of the image processing device.
[0104] For example, the image processing device can be a standalone device, integrated into other devices, or implemented through software or a combination of software and hardware.
[0105] For example, the image processing device can be an autonomous vehicle or other components within an autonomous vehicle. This image processing device includes, but is not limited to, in-vehicle terminals, in-vehicle controllers, in-vehicle modules, in-vehicle components, in-vehicle chips, in-vehicle units, in-vehicle radar, or in-vehicle cameras, etc. As an example, the image processing device can be integrated into the in-vehicle computing platform or AI accelerator of an autonomous vehicle.
[0106] For example, the image processing device can also be other data processing devices or systems besides autonomous vehicles, or components or chips installed in such devices or systems. For instance, the image processing device can be a cloud server, desktop computer, portable computer, network server, PDA, mobile phone, tablet computer, wireless terminal device, embedded device, or other data processing device, or a component or chip within such devices.
[0107] For example, the image processing device may also be a chip or processor with processing capabilities, and the image processing device may include multiple processors. The processor may be a single-core (single-CPU) processor or a multi-core (multi-CPU) processor.
[0108] It should be noted that the application scenarios described in the embodiments of this application are for the purpose of more clearly illustrating the technical solutions of the embodiments of this application, and do not constitute a limitation on the technical solutions provided by the embodiments of this application. Those skilled in the art will know that the technical solutions provided by the embodiments of this application are also applicable to similar technical problems in the face of other similar or new scenarios. For example, in addition to autonomous driving scenarios, the image processing method provided by the embodiments of this application can also be applied to any scenario that requires converting 2D input images into 3D representations.
[0109] The image processing method provided in the embodiments of this application will be described in detail below.
[0110] Figure 5 This diagram shows a flowchart of an image processing method according to an embodiment of the present application. This method can be executed by the aforementioned image processing apparatus, such as... Figure 5 As shown, the method may include the following steps:
[0111] S501. Acquire a two-dimensional image acquired by the first image acquisition device; the first image acquisition device is any image acquisition device installed on the vehicle.
[0112] For example, multiple image acquisition devices can be installed on the vehicle, with different devices used to acquire 2D images from different directions. For instance, vehicle-mounted cameras (such as pinhole cameras) can be installed at the front, left front, right front, rear, right rear, and left rear of the vehicle to acquire 2D images from the corresponding locations, i.e., 2D images from the frontal view of each image acquisition device, thereby achieving 360° environmental image acquisition around the vehicle. The first image acquisition device can be any one of the aforementioned multiple image acquisition devices. It is understood that different image acquisition devices can have different coordinate systems, meaning that the two-dimensional images acquired by different image acquisition devices can have different coordinate systems.
[0113] For example, the acquired two-dimensional image may include one or more objects in the environment surrounding the vehicle, such as other vehicles, pedestrians, obstacles, trees, traffic signs, buildings, lane lines, etc.
[0114] S502. Extract features from the two-dimensional image using a neural network model, and determine the features corresponding to at least one scene point from the extracted features.
[0115] The scene point is a preset scene point in the preset scene point set under the BEV perspective. The preset scene point set is distributed in a polar coordinate system with the vehicle as the pole, and the plane where the preset scene point set is located is parallel to the ground.
[0116] For example, a unified model of the 3D scene near the vehicle from the BEV's perspective is performed using polar coordinates to obtain a preset set of scene points. As an example, the origin of the vehicle coordinate system (also known as the vehicle body coordinate system) can be used as the pole, and a ray parallel to the ground can be drawn from this pole as the polar axis. A polar coordinate system can be pre-established on the plane containing this ray; then, multiple preset scene points can be pre-defined in this polar coordinate system to obtain a preset set of scene points from the BEV's perspective. It should be noted that the number of preset scene points included in the preset scene point set and the position of each preset scene point can be set according to requirements, and this embodiment does not limit this; thus, pre-setting scene points in polar coordinates is more consistent with the pinhole camera model.
[0117] For example, each preset scene point in the preset scene point set is uniformly distributed in the polar coordinate system. In this way, by uniformly placing preset scene points in polar coordinates around the vehicle, compared to depth estimation and implicit projection methods, this method of pre-defining uniformly distributed scene points in polar coordinates avoids the performance loss caused by pixel-level depth prediction and inconsistent projection relationships. For instance, locations near the poles, i.e., locations close to the vehicle, have a greater impact on the vehicle's movement; therefore, a larger number of preset scene points can be distributed in these locations through regularization. Conversely, locations farther from the poles have a relatively smaller impact on the vehicle's movement; a smaller number of preset scene points can be distributed in these locations through regularization. Thus, each preset scene point is uniformly distributed in polar coordinates around the vehicle.
[0118] Figure 6 This diagram illustrates a predefined distribution of scene points in polar coordinates according to an embodiment of this application. Figure 6 As shown, a polar coordinate system parallel to the ground is established with the origin of the vehicle coordinate system as the pole. Preset scene points are evenly distributed near the vehicle, forming a polarized grid of preset scene points. As an example, Θ rays can be evenly distributed on the plane containing the polar coordinate system, with the pole as the endpoint, where the included angle between adjacent rays is the same; multiple points are set as preset scene points at equal intervals on each ray. For example, 36 rays can be distributed on the plane containing the polar coordinate system, with the pole as the endpoint, where the included angle between adjacent rays is 10 degrees; 100 points are set as preset scene points at 1-meter intervals on each ray.
[0119] For example, the position p of each preset scene point in the preset scene point set in the polar coordinate system can be represented by the following formula (1):
[0120] p=(r,θ)……………………………………..(1)
[0121] Where r represents the radius coordinate in polar coordinates, that is, the distance between the preset scene point and the pole; θ represents the angular coordinate in polar coordinates, that is, the angle between the line segment from the pole to the preset scene point and the polar axis.
[0122] Preset scene points in polar coordinates can be transformed to Cartesian (rectangular) coordinates using the following formula (2):
[0123] x=rcos(θ),y=rsin(θ)………………………………..(2)
[0124] Where x represents the abscissa of the Cartesian coordinate system, y represents the ordinate of the Cartesian coordinate system, r represents the radius coordinate in polar coordinates, and θ represents the angular coordinate in polar coordinates.
[0125] It is understandable that preset scene points are predefined and do not have semantic information; therefore, they can also be called empty 3D scene points.
[0126] For example, features of the two-dimensional images acquired by each image acquisition device can be extracted using a neural network model, and features corresponding to multiple preset scene points can be determined from the extracted features. Specifically, the features extracted from the two-dimensional images using the neural network model can characterize the semantic information of the images, thereby allowing the determination of features corresponding to multiple preset scene points from the extracted features, thus enabling the originally empty 3D scene points to possess semantic information. Inspired by ray tracing, the required 2D image semantic information can be obtained from the 3D scene points along the "opposite direction" of light propagation, thereby "filling" the semantic information of multiple 2D images in different coordinate systems into predefined empty 3D scene points with a unified coordinate system.
[0127] It should be noted that the type and number of neural network models are not limited in the embodiments of this application. For example, they can be deep neural networks, convolutional neural networks, recurrent neural networks, etc.; the neural network model may include one or more neural network models.
[0128] The neural network model is trained using training data corresponding to the target task; that is, the parameters in the neural network model can be pre-trained using relevant training data corresponding to the target task. The training process of the neural network model can be referred to in the following description.
[0129] For example, the target task may include one or more of image classification, semantic segmentation, or object detection. For instance, the target task may be a downstream task related to autonomous driving, such as 3D object detection or BEV semantic segmentation.
[0130] For example, the number of target tasks can be one or more. For instance, the target task can be a 3D object detection task, or it can be a 3D object detection task and a BEV semantic segmentation task. In this way, it can be applied to a single downstream task, or it can be applied to multiple autonomous driving downstream tasks simultaneously.
[0131] S503. Execute the target task based on the features corresponding to the at least one scene point.
[0132] Understandably, based on the features corresponding to each preset scene point, the semantic information of multiple 2D images in different coordinate systems is "filled" into the preset scene points with a unified coordinate system. This results in a dense 3D scene representation defined in polar coordinates from the BEV perspective, that is, the 3D scene around the vehicle is represented in polar coordinates. Then, based on this 3D scene representation, subsequent target tasks are executed, thereby achieving pure visual detection from the BEV perspective.
[0133] For example, a 3D scene representation can be transformed to a Cartesian coordinate system through sampling to perform subsequent downstream tasks. For instance, at least one scene point in the polar coordinate system can be transformed to a Cartesian coordinate system to obtain the corresponding coordinates of the at least one scene point in the Cartesian coordinate system. Based on the features corresponding to the at least one scene point and its corresponding coordinates in the Cartesian coordinate system, a 3D scene representation defined in Cartesian coordinate form is obtained, which can then be used to perform the target task. In this way, a 3D scene representation defined in polar coordinate form can be transformed to a Cartesian coordinate system to perform subsequent downstream tasks, such as 3D object detection tasks and BEV semantic segmentation tasks.
[0134] As an example, the target task could be a 3D object detection task within the visual perception submodule of an autonomous driving system. The 3D object detection task aims to detect dynamic objects in the scene near the vehicle. Figure 7 This diagram illustrates a 3D object detection task according to an embodiment of the present application; as shown Figure 7 As shown, the image acquisition devices installed at the front, left front, right front, rear, left rear, and right rear of the vehicle respectively detect targets such as vehicles, pedestrians, and traffic signs in the corresponding 2D images, providing road condition information for the vehicle system. Compared to explicit depth prediction and implicit direct projection methods, which have poor learning capabilities, the image processing method described in this application transforms 2D images into 3D space, accurately obtaining 2D semantic information corresponding to preset scene points, and providing accurate 3D scene representation for downstream 3D target detection.
[0135] As another example, the target task can be a BEV semantic segmentation task, that is, a semantic segmentation task from the perspective of BEV. Figure 8 This diagram illustrates a BEV semantic segmentation task according to an embodiment of this application; as shown Figure 8 As shown, the BEV semantic segmentation task is to predict static road information from the perspective of BEV, including one or more of the following: drivable area, lane lines, sidewalk or crosswalk. It can also perform segmentation tasks of dynamic objects related to autonomous driving from the perspective of BEV, such as other vehicles and pedestrians. Figure 8The irregular rectangles in the image represent vehicles segmented from the BEV perspective, i.e., the projection of the vehicle 3D target detection results onto the BEV perspective. Using the image processing method described in this application, the 2D image input is converted into a 3D feature representation from the BEV perspective. The resulting 3D scene representation can be used for BEV semantic segmentation tasks, predicting static road information, etc.
[0136] In this embodiment, the 3D scene near the vehicle from the BEV perspective is uniformly modeled in polar coordinates, which is more consistent with the pinhole camera model. Features of the 2D image are extracted through a neural network model, and features corresponding to at least one scene point are determined from the extracted 2D image features. This allows the reverse acquisition of the features of the 2D image required for the predefined scene points distributed in polar coordinates, thereby transforming 2D images of different coordinate systems into a unified, accurate, and dense 3D scene representation from the BEV perspective. This avoids the error accumulation caused by depth estimation methods and the suboptimal results caused by the lack of geometric constraints in implicit projection methods. Furthermore, based on the features corresponding to at least one scene point, a target task can be executed. In some examples, there can be multiple target tasks, thus applying the unified, accurate, and dense 3D scene representation to multiple target tasks simultaneously.
[0137] Furthermore, the features corresponding to at least one scene point determined in step S502 above can be described through an attention mechanism.
[0138] In one possible implementation, the at least one scene point includes preset scene points located on the same ray in a preset set of scene points, the ray having the pole of the polar coordinate system as its endpoint; the step 502 above, which involves extracting features of the two-dimensional image through a neural network model and determining the features corresponding to the at least one scene point from the extracted features, may include: extracting features of the two-dimensional image through the neural network model and determining the features corresponding to the at least one scene point from the extracted features based on an attention mechanism.
[0139] For example, the attention mechanism may include a deformable attention mechanism and / or an adaptive polar coordinate attention mechanism. Taking the aforementioned pre-established polar coordinate system with Θ angles as an example, i.e., Θ rays, an adaptive attention mechanism can be performed once for each preset scene point on each ray, i.e., to complete the polar feature optimization operation under the entire polar coordinate system defining the 3D scene, thereby more accurately determining the features corresponding to the at least one scene point from the extracted features.
[0140] As an example, taking a ray as an example, epipolar feature optimization is performed on preset scene points on a ray. Assume there are R preset scene points on the same ray, and each preset scene point corresponds to a feature vector, which contains the corresponding 2D image features. The linear transformation operation of the linear layer in the neural network model, that is, fully connected (fc), can be defined in the form of the following formula (3):
[0141] fc(q)=qW+b………………………………(3)
[0142] Where q is the input vector, W and b are the linear layer parameters, and fc(q) is the output after linear transformation. For R feature vectors corresponding to R preset scene points on a ray, the following formula (4) is used to transform each feature vector through three different linear layers into three feature vectors of the same size, which are defined as Q, K, and V respectively:
[0143]
[0144] Where q is the input vector, that is, any one of the R feature vectors, and fc1(q), fc2(q), and fc3(q) are the outputs after three different linear transformations.
[0145] For all preset scene points on the same ray, the formula for the adaptive attention mechanism can be shown in the following formula (5):
[0146]
[0147] Where θ represents the angular coordinate of the ray in the polar coordinate system, and d K The dimension of the feature vector K is represented;
[0148] This completes the operation of applying an adaptive attention mechanism to all preset scene points on a single ray. Similarly, an adaptive attention mechanism can be applied to all preset scene points on all Θ rays in the polar coordinate system to complete the epipolar feature optimization operation for the entire polar coordinate system-defined 3D scene.
[0149] In this embodiment, considering that the probability of an object appearing at a certain angle is relatively high, that is, the probability of a preset scene point on the same ray corresponding to the features of the same object is relatively high, an adaptive attention mechanism is applied to the preset scene points on the same ray. That is, by using an adaptive attention mechanism to constrain the preset scene points on the same ray, the relationship between the preset scene points on the same ray is calculated, thereby better suppressing erroneous 3D scene information, more accurately determining the features of the 2D image corresponding to the preset scene point, and helping to make the obtained 3D scene expression more accurate.
[0150] It should be noted that the above embodiments take the implementation of the adaptive attention mechanism at preset scene points on the same ray in polar coordinates as an example; based on the concept of the embodiments of this application, for other coordinate systems, the adaptive attention mechanism can be applied to features on the same coordinate axis, thereby improving performance.
[0151] The following is an exemplary description of a possible implementation method for extracting features of the two-dimensional image through a neural network model in step S502 above, and determining the features corresponding to at least one scene point from the extracted features.
[0152] Figure 9 A flowchart of an image processing method according to an embodiment of this application is shown, such as... Figure 9 As shown, the following steps may be included:
[0153] S5021. Using the neural network model, feature extraction is performed on the two-dimensional image to obtain an image feature set; wherein, the image feature set includes features corresponding to multiple locations on the two-dimensional image.
[0154] As an example, a neural network model may include a backbone network, which can be used to extract features from two-dimensional images acquired by multiple image acquisition devices to obtain an image feature set. Exemplarily, the backbone network can be a convolutional neural network, a graph convolutional network, a recurrent neural network, or other networks with image feature extraction capabilities; this is not limited. As another example, the backbone network can be a residual network equipped with deformable convolutions.
[0155] For example, the image feature set may include features corresponding to multiple locations in multiple two-dimensional images, that is, features corresponding to multiple pixels in multiple two-dimensional images; for example, the image feature set may include multi-scale feature maps extracted from the second, third and fourth stages of the backbone network.
[0156] S5022. Determine the three-dimensional coordinates corresponding to the at least one scene point through the neural network model.
[0157] Among them, the values of x and y in the three-dimensional coordinates (x, y, z) can be determined by referring to the above formula (2), and the value of z can be determined by the neural network model, so as to obtain the three-dimensional coordinates corresponding to each preset scene point.
[0158] S5023. Based on the three-dimensional coordinates and the calibration information of the first image acquisition device, the three-dimensional coordinates are mapped to the coordinate system of the image acquisition device to determine the target position corresponding to the three-dimensional coordinates among the plurality of positions.
[0159] For example, the calibration information may include the intrinsic parameter matrix and extrinsic parameter matrix of the first image acquisition device. The intrinsic and extrinsic parameter matrices may be pre-calibrated and stored in the image processing device, and when acquiring the intrinsic and extrinsic parameter matrices of the image acquisition device, the image processing device may directly read these matrices from its local storage; alternatively, the intrinsic and extrinsic parameter matrices of the image acquisition device may also be pre-calibrated and stored in the image acquisition device, and the image processing device may request these matrices from the image acquisition device.
[0160] For example, the number of target locations can be one or more.
[0161] As an example, the three-dimensional coordinates can be mapped to the coordinate system of the image acquisition device based on the projection relationship defined by the intrinsic and extrinsic parameter matrices of the image acquisition device, that is, the three-dimensional coordinates can be mapped to the coordinate system of the two-dimensional image acquired by the image acquisition device. In this way, the target position corresponding to the three-dimensional coordinates among multiple positions on the two-dimensional image can be determined, and the specific position of the at least one scene point on the two-dimensional image can be determined.
[0162] S5024. Based on the features corresponding to the target location in the image feature set, obtain the features corresponding to the at least one scene point.
[0163] For example, the target location can correspond to one or more features, that is, the number of features corresponding to each preset scene point can be one or more. As described above. Figure 6 As shown, these are the features corresponding to the preset scene points eye1, eye2, and eye3.
[0164] As an example, the neural network model may include a decoding layer. The decoding layer executes the above steps S5022-S5024. It can use the features extracted by the backbone network to determine the three-dimensional coordinates corresponding to each preset scene point. Based on the three-dimensional coordinates corresponding to each preset scene point and the calibration information of each image acquisition device, it maps the three-dimensional coordinates corresponding to each preset scene point to the coordinate system of each image acquisition device, determines the target position corresponding to the three-dimensional coordinates of each preset scene point in the two-dimensional image, and thus fills the features corresponding to the target position in the image feature set into the corresponding preset scene points to obtain the features corresponding to each preset scene point.
[0165] In this way, by using the one-to-one projection relationship determined by the predefined preset scene points and the calibration information of the image acquisition device, the 2D semantic information on the two-dimensional images acquired by the image acquisition devices located in different coordinate systems is filled into the preset scene points, thereby realizing the transformation of 2D images in different coordinate systems into a unified, accurate, and dense 3D scene expression under the BEV perspective.
[0166] In one possible implementation, in step S5024, based on the features corresponding to the target position in the image feature set, the three-dimensional coordinates of the at least one scene point and subsequent operations can be repeatedly executed according to an attention mechanism until a preset number of loops is reached; based on the features corresponding to the target position when the preset number of loops is reached, the features corresponding to the at least one scene point are obtained.
[0167] The preset number of loops can be set according to actual needs and is not limited. It can be understood that each loop can achieve one reverse tracing.
[0168] For example, when the preset number of iterations is reached, the features corresponding to the target location can be filled into the at least one scene point, thereby obtaining the features corresponding to the at least one scene point. Performing the above operation for each preset scene point yields the features corresponding to each preset scene point, thus obtaining a complete 3D scene representation.
[0169] For example, for preset scene points on the same ray, based on the features corresponding to the target positions of each preset scene point on the ray in the image feature set, the three-dimensional coordinates and subsequent operations corresponding to each preset scene point on the ray can be repeatedly executed based on the attention mechanism until a preset number of loops is reached; based on the features corresponding to the target positions of each preset scene point on the ray when the preset number of loops is reached, the features corresponding to each preset scene point on the ray are obtained. The implementation of the attention mechanism can be referred to the above formulas (3)-(5), and will not be repeated here. In this way, an adaptive attention mechanism is executed once for each preset scene point on each ray, that is, the epipolar feature optimization operation under the entire polar coordinate system definition of the 3D scene is completed; by using the adaptive attention mechanism to constrain the preset scene points on the same ray, it is helpful to obtain a more accurate 3D scene expression.
[0170] In this embodiment, the one-to-one projection relationship determined by the predefined preset scene points and the calibration information of the image acquisition device is used to accurately project the preset scene points onto the specific positions of the two-dimensional image. At the same time, based on the adaptive attention mechanism of polar coordinates, after multiple layers of iterative encoding (i.e., after a preset number of loops), the 2D semantic information corresponding to the preset scene points is accurately obtained. The 2D semantic information on the two-dimensional images acquired by the image acquisition devices located in different coordinate systems is filled into the preset scene points, thereby realizing the transformation of 2D images in different coordinate systems into a unified, accurate, and dense 3D scene expression under the BEV perspective.
[0171] For example, Figure 10 A schematic diagram of an image processing procedure according to an embodiment of this application is shown. Figure 10As shown, multiple preset scene points are predefined, centered on the vehicle and distributed in polar coordinates, to achieve unified dense modeling of the 3D scene near the vehicle. Image feature extraction is performed through the backbone network, extracting 2D image features from multiple image acquisition devices. Simultaneously, the decoding layer learns the feature description of the preset scene points based on a deformable attention mechanism. For preset scene points on the same ray, based on an adaptive polar coordinate attention mechanism and a multi-view adaptive attention mechanism, a feed-forward neural network (FFN) is used to fill the preset scene points with the extracted 2D image features corresponding to different image acquisition devices, thus completing one reverse tracking. After the decoding layer repeats the above reverse tracking six times, it transforms the 3D scene expression defined in polar coordinates into a 3D scene expression defined in Cartesian coordinates through sampling. Then, using the BEV encoder, the 3D scene expression from the BEV perspective is obtained. For different downstream autonomous driving tasks, the obtained 3D scene expression is input into different task heads, such as a 3D object detection task head or a BEV semantic segmentation task head, to execute the corresponding downstream autonomous driving task.
[0172] The training process of the above neural network model is illustrated below.
[0173] Figure 11 A flowchart of an image processing method according to an embodiment of this application is shown. This method can be executed by the aforementioned image processing apparatus, such as... Figure 11 As shown, the following steps may be included:
[0174] S1101. Obtain training data corresponding to the target task; the training data includes two-dimensional sample images acquired by at least one image acquisition device of the vehicle.
[0175] For example, the training data may be 2D images captured by multiple onboard cameras mounted on the vehicle at different locations and with different coordinate systems. The training data may also be two-dimensional sample images obtained from an existing database, or two-dimensional sample images that can be received from other devices; for example, it may be two-dimensional sample images from the autonomous driving dataset nuScenes.
[0176] S1102. Using the training data and the preset scene point set, train the preset model to obtain the neural network model.
[0177] The preset scene point set can be referred to in the previous description, and will not be repeated here.
[0178] In this way, by learning the 3D scene representation of the vehicle through predefined scene points distributed in polar coordinates, the semantic information obtained by the trained neural network model is more accurate; and accurate 3D scene representation can be learned without the need for a deep prediction network; in addition, the trained neural network model can transform multiple 2D images into a unified, accurate, and dense 3D scene representation from the BEV perspective; it solves the error and sparsity problems in 3D scene representation that may be caused by depth estimation and implicit projection methods, and the generated 3D scene representation can be used simultaneously for multiple downstream autonomous driving tasks such as subsequent 3D object detection and BEV semantic segmentation.
[0179] In one possible implementation, step S1102, which involves training a preset model using the training data and the preset scene point set to obtain the neural network model, may include: extracting training features from the two-dimensional sample image using the preset model, and determining training features corresponding to the at least one scene point from the extracted training features; executing the target task based on the training features corresponding to the at least one scene point, and adjusting the parameters of the preset model based on the execution result until a preset training termination condition is met.
[0180] In this way, training features of two-dimensional sample images are extracted through a preset model, and training features corresponding to at least one scene point are determined from the extracted training features, thereby realizing the reverse acquisition of features of 2D images corresponding to at least one scene point.
[0181] As an example, training features of the two-dimensional sample image can be extracted using the backbone network in a preset model; training features corresponding to the at least one scene point can be determined from the extracted training features using the decoding layer in the preset model. For example, training features corresponding to the at least one scene point can be determined from the extracted training features based on the projection relationship defined by the intrinsic and extrinsic parameter matrices of the image acquisition device.
[0182] For example, the loss function value can be obtained by comparing the execution result with the expected result corresponding to the target task. The parameters of the preset model can then be updated through backpropagation of the loss function value. The model with updated parameters can then be trained using the next batch of training samples (i.e., re-execute steps 1101 to 1102) until a preset training termination condition is met (e.g., the loss function converges, a preset number of iterations is reached, etc.), resulting in a trained neural network model. For instance, if the target task is vehicle recognition, the training data can include multiple sample images collected by multiple vehicle image acquisition devices, containing vehicles, which can be pre-labeled. These sample images are then input into the preset model, which can extract training features from the sample images and determine the training features corresponding to each preset scene point. Vehicle recognition is then performed based on the training features corresponding to each preset scene point to obtain the vehicle recognition result. This vehicle recognition result is compared with the pre-labeled vehicles to determine the loss function value. It is understood that a higher loss function value indicates a greater difference between the execution result obtained by the preset model and the actual result, and vice versa. Therefore, the parameter values in the preset model can be adjusted through backpropagation of the loss function value; repeat the above operation until the preset training termination condition is reached.
[0183] In one possible implementation, the step of extracting training features from the two-dimensional sample image using the preset model and determining training features corresponding to the at least one scene point from the extracted training features includes: obtaining scene points in the preset scene point set that are located on the same ray as the at least one scene point; extracting training features from the two-dimensional sample image using the preset model, and determining training features corresponding to each scene point from the extracted training features based on an attention mechanism.
[0184] For example, the attention mechanism can be an adaptive polar coordinate attention mechanism. Specifically, the adaptive attention mechanism can be applied to each scene point on each ray in the polar coordinate system, determining the training features corresponding to each scene point from the extracted training features. In this way, for preset scene points on the same ray, the attention mechanism, i.e., the adaptive polar coordinate attention mechanism, helps the preset model learn a more accurate 3D scene representation. The specific process of implementing the attention mechanism can be referred to the relevant descriptions above, and will not be repeated here.
[0185] Through the above steps S1101 to S1102, a trained neural network model is obtained, and then the following steps S1103 to S1105 can be executed to transform the 2D image into a 3D scene representation with a unified 3D coordinate system and BEV perspective, thereby enabling the execution of one or more downstream autonomous driving tasks.
[0186] S1103. Acquire a two-dimensional image acquired by the first image acquisition device; the first image acquisition device is any image acquisition device installed on the vehicle.
[0187] This step is the same as the above. Figure 5 The steps in step S501 are the same and will not be repeated here.
[0188] S1104. Extract features from the two-dimensional image using a neural network model, and determine the features corresponding to at least one scene point from the extracted features.
[0189] This step is the same as the above. Figure 5 The steps in step S502 are the same and will not be repeated here.
[0190] S1105. Execute the target task based on the features corresponding to the at least one scene point.
[0191] This step is the same as the above. Figure 5 The same applies to step S503, so it will not be repeated here.
[0192] For example, Figure 12 A schematic diagram illustrating the model training process according to an embodiment of this application is shown. Figure 12As shown, for two-dimensional sample images acquired by multiple image acquisition devices in different coordinate systems, training features of the two-dimensional sample images are extracted through the backbone network in the preset model to obtain training features of the two-dimensional sample images in different coordinate systems. Empty 3D scene points without semantic information are uniformly set around the vehicle in polar coordinates, centered on the vehicle. Using the projection relationship defined by the intrinsic and extrinsic parameter matrices of the image acquisition devices, the training features of the two-dimensional sample images corresponding to the empty 3D scene points are determined, and these training features are filled into the empty 3D scene points. Then, an adaptive attention mechanism is executed on the 3D scene points on the same ray in the polar coordinate system to complete epipolar feature optimization, helping the model learn more accurate 3D scene information. This process is repeated to determine the two-dimensional sample images corresponding to the empty 3D scene points. The training features and epipolar feature optimization operations are performed until the preset number of iterations in the decoder layer is reached. The decoder layer of the preset model outputs a 3D scene representation defined in polar coordinates. Then, through sampling, the 3D scene representation defined in polar coordinates is transformed into a 3D scene representation defined in Cartesian coordinates centered on the vehicle. For different autonomous driving tasks, the 3D scene representation defined in Cartesian coordinates can be input into different task heads, such as a 3D object detection head and a BEV semantic segmentation head, to execute the relevant tasks. Based on the task execution results, the parameters of the entire preset model are updated through gradient descent. The model with updated parameters is then iteratively trained using the next batch of training samples until the model reaches the preset number of iterations, thus completing the model training and obtaining a trained neural network model.
[0193] The performance of the image processing method provided in this application will be illustrated below using 3D object detection and BEV semantic segmentation tasks as examples.
[0194] As an example, taking the large-scale multi-instance autonomous driving dataset nuScenes as an example, the image processing method described in this application embodiment is applied to a 3D object detection task. The image processing method described in this application embodiment will be referred to as the Ego3RT model. The effectiveness of Ego3RT is evaluated on the nuScenes dataset, which is a large-scale autonomous driving dataset with 1000 driving scenes. Specifically, the nuScenes dataset provides image streams from six cameras at different vehicle orientations, the intrinsic and extrinsic parameter matrices for each camera, and complete multi-instance annotation information; the size of each image in the image stream is (1600, 900). The 1000 scenes, each approximately 20 seconds long, in the nuScenes dataset are split into 700 scenes for the training set, 150 scenes for the validation set, and 150 scenes for the test set.
[0195] For example, the mean Average Precision (mAP) and the NuScenes Detection Score (NDS) are used as evaluation metrics to assess the performance of 3D object detection on the NuScenes dataset. Higher values for both metrics are better. mAP is the average precision across different distance thresholds (e.g., 0.5m, 1m, 2m, 4m) in the BEV viewpoint. NDS is a weighted average of mAP and the True Positive (TP) metric, where TP is the average of five individual metrics: Average Translation Error (ATE), Average Velocity Error (AVE), Average Scale Error (ASE), Average Orientation Error (AOE), and Average Attribute Error (AAE). The formula for calculating NDS can be expressed as: Where mTP represents any of the above average indices.
[0196] In this embodiment, a Residual Network-101 (ResNet-101) equipped with deformable convolutions is used as the backbone network; the decoding layer utilizes multi-scale feature maps from stages 2, 3, and 4 of the backbone network as 2D semantic features. Predefined scene points, with a resolution of 80x256 in polar coordinates (i.e., 80 rays), each ray containing 256 preset scene points, are sampled and transformed to a Cartesian coordinate system with a resolution of 160x160 for subsequent downstream tasks. In this embodiment, the task head uses the CenterPoint detection head, widely used in 3D object detection.
[0197] Ego3RT was trained using the training method described in the previous examples on the nuScenes training set. The trained Ego3RT was then used to perform 3D object detection on the nuScenes test set. The object detection results were then compared with those of existing detection models such as MonoDIS, CenterNet, FCOS3D, PGD, DD3D, and DETR3D on the nuScenes test set. Table 1 shows the evaluation metrics for different detection models performing 3D object detection on the nuScenes test set. This indicates that a backbone network pre-trained on DD3D is used.
[0198] Table 1 - Evaluation metrics of different detection methods for 3D object detection on the nuScenes test set
[0199]
[0200] As shown in Table 1 above, Ego3RT trained using the method in this application embodiment has successfully achieved the best pure visual 3D object detection effect. This demonstrates that the image processing method described in this application embodiment can better utilize images acquired by multiple image acquisition devices to generate dense 3D scene representations that meet the requirements of downstream tasks.
[0201] The following section compares the effects of the existing Cartesian coordinate detection method with the image processing method in this application embodiment that defines preset scene points in polar coordinates, and the image processing method in this application embodiment that defines preset scene points and attention mechanism in polar coordinates. Table 2 shows the comparison of ablation experiment results in this embodiment.
[0202] Table 2 - Comparison of Ablation Experiment Results
[0203]
[0204] As shown in Table 2 above, the image processing method that defines the preset scene points near the vehicle using polar coordinates significantly improves the corresponding 3D object detection metrics (mATE, mAAE, and NDS) compared to the method using Cartesian coordinates. Furthermore, the image processing method that incorporates an adaptive attention mechanism based on the polar coordinate definition of preset scene points, by constraining preset scene points on the same ray, shows significant improvements in all 3D object detection metrics except for mATE. The significant improvements in the main metrics mAP and NDS demonstrate the effectiveness of the image processing method using polar coordinates to define preset scene points, and the image processing method using polar coordinates and incorporating an attention mechanism, as described in this application.
[0205] As another example, taking the large-scale multi-instance autonomous driving dataset nuScenes as an example, the image processing method described in this application embodiment is applied to the BEV semantic segmentation task. The dataset nuScenes, backbone network, decoder, and polar coordinate preset scene points used in this application embodiment are the same as those in the above 3D object detection task example, and will not be repeated here. In this application embodiment, the task head uses the BEV semantic segmentation head; wherein, the BEV semantic segmentation head module uses standard deconvolution for upsampling, upsampling the 3D scene representation in the Cartesian coordinate system with a resolution of 160x160 to a resolution of 480x480 to obtain more details. In this application embodiment, the Intersection over Union (IoU) or (class) average Intersection over Union (mIoU) is used to evaluate the performance of the BEV semantic segmentation head, where a higher IoU is better.
[0206] The Ego3RT was trained using the training method described in the previous embodiment on the nuScenes training set. The trained Ego3RT was then used to perform BEV semantic segmentation on the nuScenes test set. The BEV semantic segmentation results were then compared with those of existing semantic segmentation models such as VED, VPN, PON, OFT, LSF, Image2Map, and LSS on the nuScenes test set. Table 3 shows the evaluation index values of different semantic segmentation models for BEV semantic segmentation on the nuScenes test set. The evaluation index values include the intersection-over-union ratio (IoU) for drivable areas, intersections, walkways, carparks, and dividers.
[0207] Table 3 - Evaluation metrics of different semantic segmentation models for BEV semantic segmentation on the nuScenes test set
[0208] Model Multiple categories Driving area Traffic intersection pedestrian crossing PARKING LOT Separator VED no 54.7 12.0 20.7 13.5 - VPN no 58.0 27.3 29.4 12.3 - PON no 60.4 28.0 31.0 18.4 - OFT no 62.4 30.9 34.5 23.5 - LSF no 61.1 33.5 37.8 25.4 - Image2Map no 74.5 36.6 35.9 31.3 - OFT yes 71.7 - - - 18.0 LSS yes 72.9 - - - 20.0 Ego3RT yes 79.6 48.3 52.0 50.3 47.5
[0209] As shown in Table 3 above, the Ego3RT trained using the method described in this application embodiment achieves state-of-the-art results in downstream BEV semantic segmentation tasks related to autonomous driving. This demonstrates the effectiveness, versatility, and scalability of the image processing method described in this application embodiment. Compared to existing models with weaknesses such as poor performance or limitation to 3D object detection, the Ego3RT model in this application embodiment can be applied to multiple downstream tasks simultaneously, achieving state-of-the-art performance in all downstream tasks, exhibiting better scalability and stronger versatility.
[0210] Thus, through experiments on the large-scale autonomous driving dataset nuScenes, the Ego3RT model in this application embodiment has achieved state-of-the-art results on multiple downstream autonomous driving benchmark tasks, demonstrating powerful capabilities and the effectiveness and importance of generating dense and general 3D scene representations, effectively improving the performance of the perception module in the autonomous driving system.
[0211] Based on the same inventive concept as the above method embodiments, embodiments of this application also provide an image processing apparatus, which can be used to execute the technical solutions described in the above method embodiments. For example, it can execute the above... Figure 5 , Figure 9 or Figure 11 The steps of the image processing method shown are as follows.
[0212] Figure 13 A block diagram of an image processing apparatus according to an embodiment of this application is shown. Figure 13 As shown, the device may include: an acquisition module 1301, used to acquire a two-dimensional image acquired by a first image acquisition device; the first image acquisition device is any image acquisition device installed on the vehicle; a feature determination module 1302, used to extract features from the two-dimensional image through a neural network model, and determine features corresponding to at least one scene point among the extracted features; wherein, the scene point is a preset scene point in a preset scene point set under the BEV perspective, the preset scene point set is distributed in a polar coordinate system with the vehicle as the pole, and the plane where the preset scene point set is located is parallel to the ground; the neural network model is trained by training data corresponding to the target task; and an execution module 1303, used to execute the target task according to the features corresponding to the at least one scene point.
[0213] In this embodiment, the 3D scene near the vehicle from the BEV perspective is uniformly modeled in polar coordinates, which is more consistent with the pinhole camera model. Features of the 2D image are extracted through a neural network model, and features corresponding to at least one scene point are determined from the extracted 2D image features. This allows the reverse acquisition of the features of the 2D image required for the preset scene points distributed in polar coordinates. This transforms 2D images from different coordinate systems into a unified, accurate, and dense 3D scene representation from the BEV perspective, avoiding the error accumulation caused by depth estimation and the suboptimal results caused by the lack of geometric constraints in implicit projection. Furthermore, based on the features corresponding to at least one scene point, a target task can be executed. In some examples, there can be multiple target tasks, thus enabling the unified, accurate, and dense 3D scene representation to be applied to multiple target tasks simultaneously.
[0214] In one possible implementation, the at least one scene point includes preset scene points located on the same ray in the preset scene point set, the ray having the pole as its endpoint; the feature determination module 1302 is further configured to extract features of the two-dimensional image through the neural network model, and determine the features corresponding to the at least one scene point from the extracted features based on an attention mechanism.
[0215] In one possible implementation, the feature determination module 1302 is further configured to: extract features from the two-dimensional image using the neural network model to obtain an image feature set; wherein the image feature set includes features corresponding to multiple locations on the two-dimensional image; determine the three-dimensional coordinates corresponding to the at least one scene point using the neural network model; map the three-dimensional coordinates to the coordinate system of the image acquisition device according to the three-dimensional coordinates and the calibration information of the first image acquisition device to determine the target position corresponding to the three-dimensional coordinates among the multiple locations; and obtain the features corresponding to the at least one scene point according to the features corresponding to the target position in the image feature set.
[0216] In one possible implementation, the feature determination module 1302 is further configured to: based on the features corresponding to the target position in the image feature set, repeatedly execute the determination of the three-dimensional coordinates corresponding to the at least one scene point and subsequent operations based on an attention mechanism until a preset number of loops is reached; and obtain the features corresponding to the at least one scene point based on the features corresponding to the target position when the preset number of loops is reached.
[0217] In one possible implementation, the preset scene points in the preset scene point set are evenly distributed in the polar coordinate system.
[0218] In one possible implementation, the device further includes: a training module for acquiring training data corresponding to the target task; the training data includes two-dimensional sample images acquired by at least one image acquisition device of the vehicle; the training module is further configured to use the training data and the preset scene point set to train a preset model to obtain the neural network model.
[0219] In one possible implementation, the training module is further configured to: extract training features of the two-dimensional sample image through the preset model, and determine training features corresponding to the at least one scene point from the extracted training features; execute the target task according to the training features corresponding to the at least one scene point, and adjust the parameters of the preset model according to the execution result until the preset training termination condition is reached.
[0220] In one possible implementation, the training module is further configured to: acquire each scene point in the preset scene point set that is located on the same ray as the at least one scene point; extract training features of the two-dimensional sample image through the preset model, and determine the training features corresponding to each scene point from the extracted training features based on an attention mechanism.
[0221] In one possible implementation, the execution module 1303 is further configured to: convert the at least one scene point to a Cartesian coordinate system to obtain the coordinates of the at least one scene point in the Cartesian coordinate system; and execute the target task based on the features corresponding to the at least one scene point and the coordinates of the at least one scene point in the Cartesian coordinate system.
[0222] In one possible implementation, the target task includes one or more of image classification, semantic segmentation, or object detection.
[0223] The above Figure 13 The technical effects and specific descriptions of the image processing apparatus and its various possible implementations can be found in the above-mentioned image processing methods, and will not be repeated here.
[0224] It should be understood that the division of modules in the above device is only a logical functional division. In actual implementation, they can be fully or partially integrated into a single physical entity, or they can be physically separated. Furthermore, the modules in the device can be implemented by a processor calling software; for example, the device includes a processor connected to a memory containing instructions. The processor calls the instructions stored in the memory to implement any of the above methods or to implement the functions of each module in the device. The processor can be, for example, a general-purpose processor, such as a Central Processing Unit (CPU) or a microprocessor, and the memory can be internal or external to the device. Alternatively, the modules in the device can be implemented as hardware circuits. The functionality of some or all modules can be achieved through the design of these hardware circuits, which can be understood as one or more processors. For example, in one implementation, the hardware circuit is an application-specific integrated circuit (ASIC). The functionality of some or all of the modules is achieved through the design of the logical relationships between the components within the circuit. In another implementation, the hardware circuit can be implemented using a programmable logic device (PLD). Taking a field-programmable gate array (FPGA) as an example, it can include a large number of logic gates. The connection relationships between these logic gates are configured through configuration files, thereby achieving the functionality of some or all of the modules. All modules of the above device can be implemented entirely through processor-invoked software, entirely through hardware circuits, or partially through processor-invoked software with the remaining parts implemented through hardware circuits.
[0225] In this application embodiment, a processor is a circuit with signal processing capabilities. In one implementation, the processor can be a circuit with instruction read and execute capabilities, such as a CPU, microprocessor, graphics processing unit (GPU), digital signal processor (DSP), neural network processing unit (NPU), tensor processing unit (TPU), etc. In another implementation, the processor can implement certain functions through the logical relationships of hardware circuits. These logical relationships of hardware circuits are fixed or reconfigurable. For example, the processor is a hardware circuit implemented by an ASIC or PLD, such as an FPGA. In a reconfigurable hardware circuit, the process of the processor loading a configuration document and configuring the hardware circuit can be understood as the process of the processor loading instructions to implement the functions of some or all of the above modules.
[0226] As can be seen, each module in the above apparatus can be one or more processors (or processing circuits) configured to implement the methods of the above embodiments, such as: CPU, GPU, NPU, TPU, microprocessor, DSP, ASIC, FPGA, or a combination of at least two of these processor types. Furthermore, each module in the above apparatus can be integrated in whole or in part, or can be implemented independently; there is no limitation on this.
[0227] Embodiments of this application also provide an image processing apparatus, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to implement the method of the above embodiments when executing the instructions. Exemplarily, the above can be performed. Figure 5 , Figure 9 or Figure 11 The steps of the image processing method shown are as follows.
[0228] Figure 14 This diagram illustrates the structure of an image processing apparatus according to an embodiment of the present application, as shown below. Figure 14 As shown, the image processing device may include at least one processor 701, a communication line 702, a memory 703, and at least one communication interface 704.
[0229] Processor 701 may be a general-purpose central processing unit, a microprocessor, an application-specific integrated circuit, or one or more integrated circuits for controlling the execution of programs according to the present application. Processor 701 may also include a heterogeneous computing architecture of multiple general-purpose processors, such as a combination of at least two of CPU, GPU, microprocessor, DSP, ASIC, and FPGA. As an example, processor 701 may be CPU+GPU, CPU+ASIC, or CPU+FPGA.
[0230] Communication line 702 may include a path for transmitting information between the aforementioned components.
[0231] The communication interface 704 uses any transceiver-like device for communicating with other devices or communication networks, such as Ethernet, RAN, wireless local area networks (WLAN), etc.
[0232] The memory 703 may be a read-only memory (ROM) or other type of static storage device capable of storing static information and instructions, random access memory (RAM) or other type of dynamic storage device capable of storing information and instructions, or electrically erasable programmable read-only memory (EEPROM), compact disc read-only memory (CD-ROM) or other optical disc storage, optical disc storage (including compressed optical discs, laser discs, optical discs, digital universal optical discs, Blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium capable of carrying or storing desired program code in the form of instructions or data structures and accessible by a computer, but is not limited thereto. The memory may exist independently and be connected to the processor via communication line 702. The memory may also be integrated with the processor. The memory provided in this application embodiment is generally non-volatile. The memory 703 is used to store computer execution instructions for executing the scheme of this application and is controlled by the processor 701 for execution. The processor 701 is used to execute computer execution instructions stored in the memory 703, thereby implementing the method provided in the above embodiments of this application; exemplarily, the above can be implemented. Figure 5 , Figure 9 or Figure 11 The steps of the image processing method shown are as follows.
[0233] Optionally, the computer execution instructions in the embodiments of this application may also be referred to as application code, and the embodiments of this application do not specifically limit this.
[0234] For example, processor 701 may include one or more CPUs, e.g. Figure 14 CPU0 in the CPU; processor 701 may also include a CPU, and any one of GPU, ASIC, or FPGA, for example, Figure 14 The CPU0+GPU0, CPU0+ASIC0, or CPU0+FPGA0 are mentioned.
[0235] For example, an image processing apparatus may include multiple processors, such as Figure 14 Processors 701 and 707 are mentioned. Each of these processors can be a single-core processor, a multi-core processor, or a heterogeneous computing architecture that includes multiple general-purpose processors. Here, "processor" can refer to one or more devices, circuits, and / or processing cores used to process data (e.g., computer program instructions).
[0236] In a specific implementation, as one embodiment, the image processing apparatus may further include an output device 705 and an input device 706. The output device 705 communicates with the processor 701 and can display information in various ways. For example, the output device 705 may be a liquid crystal display (LCD), a light-emitting diode (LED) display device, a cathode ray tube (CRT) display device, or a projector, such as a vehicle-mounted HUD, AR-HUD, or monitor. The input device 706 communicates with the processor 701 and can receive user input in various ways. For example, the input device 706 may be a mouse, keyboard, touchscreen device, or sensing device.
[0237] Embodiments of this application provide a computer-readable storage medium storing computer program instructions thereon, which, when executed by a processor, implement the methods described in the above embodiments. Exemplarily, the above can be implemented... Figure 5 , Figure 9 or Figure 11 The steps of the image processing method shown are as follows.
[0238] Embodiments of this application provide a computer program product, which may include, for example, computer-readable code or a non-volatile computer-readable storage medium carrying computer-readable code; when the computer program product is run on a computer, the computer performs the methods described in the above embodiments. Exemplarily, the above... Figure 4 , Figure 7 or Figure 11 The steps of the image processing method shown are as follows.
[0239] Computer-readable storage media can be tangible devices capable of holding and storing instructions for use by an instruction execution device. Computer-readable storage media can be, for example—but not limited to—electrical storage devices, magnetic storage devices, optical storage devices, electromagnetic storage devices, semiconductor storage devices, or any suitable combination thereof. More specific examples (a non-exhaustive list) of computer-readable storage media include: portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static random access memory (SRAM), portable compact disc read-only memory (CD-ROM), digital multifunction disc (DVD), memory sticks, floppy disks, mechanical encoding devices, such as punch cards or recessed protrusions storing instructions thereon, and any suitable combination thereof. The computer-readable storage media used herein are not to be construed as transient signals themselves, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., light pulses through fiber optic cables), or electrical signals transmitted through wires.
[0240] The computer-readable program instructions described herein can be downloaded from computer-readable storage media to various computing / processing devices, or downloaded via a network, such as the Internet, local area network, wide area network, and / or wireless network, to an external computer or external storage device. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and / or edge servers. A network adapter card or network interface in each computing / processing device receives the computer-readable program instructions from the network and forwards them to the computer-readable storage media in the respective computing / processing device.
[0241] The computer program instructions used to perform the operations of this application may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, status setting data, or source code or object code written in any combination of one or more programming languages, including object-oriented programming languages such as Smalltalk, C++, etc., and conventional procedural programming languages such as the "C" language or similar programming languages. The computer-readable program instructions may be executed entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving a remote computer, the remote computer may be connected to the user's computer via any type of network—including a local area network (LAN) or a wide area network (WAN)—or may be connected to an external computer (e.g., via the Internet using an Internet service provider). In some embodiments, electronic circuits, such as programmable logic circuits, field-programmable gate arrays (FPGAs), or programmable logic arrays (PLAs), are personalized by utilizing the status information of the computer-readable program instructions. These electronic circuits can execute the computer-readable program instructions to implement various aspects of this application.
[0242] Various aspects of this application are described herein with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this application. It should be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer-readable program instructions.
[0243] These computer-readable program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing apparatus to produce a machine such that, when executed by the processor of the computer or other programmable data processing apparatus, they create means for implementing the functions / actions specified in one or more blocks of the flowchart and / or block diagram. These computer-readable program instructions can also be stored in a computer-readable storage medium that causes a computer, programmable data processing apparatus, and / or other device to operate in a particular manner; thus, the computer-readable medium storing the instructions comprises an article of manufacture that includes instructions for implementing aspects of the functions / actions specified in one or more blocks of the flowchart and / or block diagram.
[0244] Computer-readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable data processing apparatus, or other device to produce a computer-implemented process, thereby causing the instructions executed on the computer, other programmable data processing apparatus, or other device to perform the functions / actions specified in one or more boxes of a flowchart and / or block diagram.
[0245] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this application. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of an instruction containing one or more executable instructions for implementing a specified logical function. In some alternative implementations, the functions marked in the blocks may occur in a different order than those marked in the drawings. For example, two consecutive blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, can be implemented using a dedicated hardware-based system that performs the specified function or action, or using a combination of dedicated hardware and computer instructions.
[0246] The various embodiments of this application have been described above. These descriptions are exemplary and not exhaustive, nor are they limited to the disclosed embodiments. Many modifications and variations will be apparent to those skilled in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen to best explain the principles, practical applications, or technological improvements to the embodiments in the market, or to enable others skilled in the art to understand the embodiments disclosed herein.
Claims
1. An image processing method, characterized in that, include: Acquire a two-dimensional image captured by a first image acquisition device; the first image acquisition device can be any image acquisition device installed on the vehicle. Features of the two-dimensional image are extracted using a neural network model, and features corresponding to at least one scene point are identified from the extracted features; wherein, the scene point is a preset scene point in a preset scene point set under the bird's-eye view of the BEV, the preset scene point set is distributed in a polar coordinate system with the vehicle as the pole, and the plane where the preset scene point set is located is parallel to the ground; the neural network model is trained using training data corresponding to the target task. The target task is executed based on the features corresponding to the at least one scene point.
2. The method according to claim 1, characterized in that, The at least one scene point includes a preset scene point located on the same ray in the preset scene point set, and the ray has the pole as its endpoint; The step of extracting features from the two-dimensional image using a neural network model and determining features corresponding to at least one scene point from the extracted features includes: The neural network model extracts features from the two-dimensional image, and based on an attention mechanism, determines the features corresponding to the at least one scene point from the extracted features.
3. The method according to claim 1 or 2, characterized in that, The step of extracting features from the two-dimensional image using a neural network model and determining features corresponding to at least one scene point from the extracted features includes: The neural network model is used to extract features from the two-dimensional image to obtain an image feature set; wherein the image feature set includes features corresponding to multiple locations on the two-dimensional image; The neural network model is used to determine the three-dimensional coordinates corresponding to the at least one scene point; Based on the three-dimensional coordinates and the calibration information of the first image acquisition device, the three-dimensional coordinates are mapped to the coordinate system of the image acquisition device to determine the target position corresponding to the three-dimensional coordinates among the plurality of positions; Based on the features corresponding to the target location in the image feature set, the features corresponding to the at least one scene point are obtained.
4. The method according to claim 3, characterized in that, The step of obtaining the features corresponding to the at least one scene point based on the features corresponding to the target location in the image feature set includes: Based on the features corresponding to the target location in the image feature set, and based on the attention mechanism, the determination of the three-dimensional coordinates of at least one scene point and subsequent operations are repeatedly performed until a preset number of loops is reached. Based on the features corresponding to the target position when the preset number of cycles is reached, the features corresponding to the at least one scene point are obtained.
5. The method according to any one of claims 1-4, characterized in that, The preset scene points in the preset scene point set are evenly distributed in the polar coordinate system.
6. The method according to any one of claims 1-5, characterized in that, The method further includes: Acquire training data corresponding to the target task; the training data includes two-dimensional sample images acquired by at least one image acquisition device of the vehicle; The preset model is trained using the training data and the preset scene point set to obtain the neural network model.
7. The method according to claim 6, characterized in that, The step of training a preset model using the training data and the preset scene point set to obtain the neural network model includes: The training features of the two-dimensional sample image are extracted using the preset model, and the training features corresponding to the at least one scene point are determined from the extracted training features. Based on the training features corresponding to at least one scene point, the target task is executed, and the parameters of the preset model are adjusted according to the execution results until the preset training termination condition is met.
8. The method according to claim 7, characterized in that, The step of extracting training features from the two-dimensional sample image using the preset model, and determining the training features corresponding to the at least one scene point from the extracted training features, includes: Obtain each scene point in the preset scene point set that is located on the same ray as the at least one scene point; The training features of the two-dimensional sample images are extracted using the preset model, and based on the attention mechanism, the training features corresponding to each scene point are determined from the extracted training features.
9. The method according to any one of claims 1-8, characterized in that, The step of executing the target task based on the features corresponding to the at least one scene point includes: Transform the at least one scene point into a Cartesian coordinate system to obtain the coordinates of the at least one scene point in the Cartesian coordinate system; The target task is executed based on the features corresponding to the at least one scene point and the coordinates of the at least one scene point in the Cartesian coordinate system.
10. The method according to any one of claims 1-9, characterized in that, The target task includes one or more of the following: image classification, semantic segmentation, or object detection.
11. An image processing apparatus, characterized in that, The device includes: The acquisition module is used to acquire two-dimensional images acquired by the first image acquisition device; the first image acquisition device can be any image acquisition device installed on the vehicle. The feature determination module is used to extract features from the two-dimensional image through a neural network model, and to determine the features corresponding to at least one scene point among the extracted features; wherein, the scene point is a preset scene point in a preset scene point set under the bird's-eye view of the BEV, the preset scene point set is distributed in a polar coordinate system with the vehicle as the pole, and the plane where the preset scene point set is located is parallel to the ground; the neural network model is trained from the training data corresponding to the target task. An execution module is used to execute the target task based on the features corresponding to the at least one scene point.
12. The apparatus according to claim 11, characterized in that, The at least one scene point includes a preset scene point located on the same ray in the preset scene point set, and the ray has the pole as its endpoint; The feature determination module is further configured to extract features of the two-dimensional image through the neural network model, and, based on an attention mechanism, determine the features corresponding to the at least one scene point from the extracted features.
13. The apparatus according to claim 11 or 12, characterized in that, The feature determination module is further configured to: The neural network model is used to extract features from the two-dimensional image to obtain an image feature set; wherein the image feature set includes features corresponding to multiple locations on the two-dimensional image; The neural network model is used to determine the three-dimensional coordinates corresponding to the at least one scene point; Based on the three-dimensional coordinates and the calibration information of the first image acquisition device, the three-dimensional coordinates are mapped to the coordinate system of the image acquisition device to determine the target position corresponding to the three-dimensional coordinates among the plurality of positions; Based on the features corresponding to the target location in the image feature set, the features corresponding to the at least one scene point are obtained.
14. The apparatus according to claim 13, characterized in that, The feature determination module is further configured to: Based on the features corresponding to the target location in the image feature set, and based on the attention mechanism, the determination of the three-dimensional coordinates of at least one scene point and subsequent operations are repeatedly performed until a preset number of loops is reached. Based on the features corresponding to the target position when the preset number of cycles is reached, the features corresponding to the at least one scene point are obtained.
15. The apparatus according to any one of claims 11-14, characterized in that, The preset scene points in the preset scene point set are evenly distributed in the polar coordinate system.
16. The apparatus according to any one of claims 11-15, characterized in that, The device further includes: The training module is used to acquire training data corresponding to the target task; the training data includes two-dimensional sample images acquired by at least one image acquisition device of the vehicle. The training module is also used to train the preset model using the training data and the preset scene point set to obtain the neural network model.
17. The apparatus according to claim 16, characterized in that, The training module is also used for: The training features of the two-dimensional sample image are extracted using the preset model, and the training features corresponding to the at least one scene point are determined from the extracted training features. Based on the training features corresponding to at least one scene point, the target task is executed, and the parameters of the preset model are adjusted according to the execution results until the preset training termination condition is met.
18. The apparatus according to claim 17, characterized in that, The training module is also used for: Obtain each scene point in the preset scene point set that is located on the same ray as the at least one scene point; The training features of the two-dimensional sample images are extracted using the preset model, and based on the attention mechanism, the training features corresponding to each scene point are determined from the extracted training features.
19. The apparatus according to any one of claims 11-18, characterized in that, The execution module is further configured to: Transform the at least one scene point into a Cartesian coordinate system to obtain the coordinates of the at least one scene point in the Cartesian coordinate system; The target task is executed based on the features corresponding to the at least one scene point and the coordinates of the at least one scene point in the Cartesian coordinate system.
20. The apparatus according to any one of claims 11-19, characterized in that, The target task includes one or more of the following: image classification, semantic segmentation, or object detection.
21. An image processing apparatus, characterized in that, include: processor; Memory used to store processor-executable instructions; The processor is configured to implement the method described in any one of claims 1-10 when executing the instructions.
22. A computer-readable storage medium having computer program instructions stored thereon, characterized in that, When the computer program instructions are executed by the processor, they implement the method described in any one of claims 1-10.
23. A computer program product, characterized in that, When the computer program product is run on a computer, it causes the computer to perform the method described in any one of claims 1-10.