A target perception method and apparatus
By iteratively fusing dynamic and static target features, the problem of insufficient depth information in purely vision-based autonomous driving is solved, achieving more accurate target perception and location recognition, and improving the safety of autonomous driving.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- HUAWEI TECH CO LTD
- Filing Date
- 2023-05-30
- Publication Date
- 2026-06-12
Smart Images

Figure CN116883961B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of vehicles, and more particularly to a target perception method and device. Background Technology
[0002] Vision-based autonomous driving perception is gaining increasing attention. Compared to laser point clouds, images captured by cameras can provide 3D detection capabilities at greater distances, richer visual semantic information, and lower deployment costs. However, camera-captured images typically lack depth information, making vision-based 3D perception quite challenging. Improving the performance of vision-based 3D perception with an acceptable computational load, utilizing multi-task information, remains a major challenge in the field.
[0003] For example, taking autonomous driving as an example, the process typically requires 3D dynamic object detection and bird's-eye view (BEV) static road structure recognition. Usually, a backbone network can be used to extract depth features from the input image and transform these features into the BEV space for subsequent object detection or segmentation tasks. However, this can lead to insufficient feature extraction. Therefore, how to fully extract the information contained in the features becomes a pressing issue. Summary of the Invention
[0004] This application provides a target perception method and apparatus for fully mining the features of each target in the case of sparse data, more accurately perceiving static targets and information about static targets, and improving the accuracy of target perception.
[0005] In view of the above, in a first aspect, this application provides a target perception method, comprising: first, acquiring image features, which may include features extracted from an input image; then, iteratively acquiring dynamic target features and static target features represented in the input image based on the image features, wherein the objects in the input image include dynamic targets and static targets, the dynamic target features being the extracted features of the dynamic targets, and the static target features being the extracted features of the static targets, wherein the moving speed of the dynamic targets is greater than the moving speed of the static targets; subsequently, perceiving and acquiring information about the dynamic targets and information about the static targets based on the features of the dynamic targets and the features of the static targets.
[0006] In particular, any one of the aforementioned iteration processes may include: first, obtaining location information, which includes information representing the location of static targets and information representing the location of dynamic targets; then, fusing the dynamic target features and static target features obtained in the previous iteration based on the location information to obtain a fusion result; and then, based on the fusion result and the location information, performing feature sampling from the image features to obtain the dynamic target features and static target features of the current iteration.
[0007] Therefore, in the embodiments of this application, during the iterative process of target perception, dynamic target features and static target features can be fused to achieve contextual information fusion between dynamic and static targets, thereby obtaining information that can more accurately represent the relative positions of dynamic and static targets. The features of each target can be fully explored, and sampling can be performed from image features based on the fusion results, thereby more accurately collecting dynamic and static target features from image features, thus improving the iterative convergence efficiency, the accuracy of target perception, and the perception efficiency.
[0008] In one possible implementation, the aforementioned acquisition of location information may include: acquiring location information based on the dynamic and static target features output from the previous iteration. Therefore, in this embodiment, during each iteration, location information can be obtained by combining the features obtained from the previous iteration, thereby achieving layer-by-layer iteration.
[0009] In one possible implementation, the aforementioned location information may include the location of the dynamic target in 3D space and its location in image features, and the location of the static target in 3D space and its location in image features. The aforementioned acquisition of location information based on the dynamic target features and static target features output from the previous iteration may include: decoding the dynamic target features output from the previous iteration to obtain the location of the dynamic target in 3D space and its location in image features, where the 3D space can be understood as the space representing the scene where the dynamic and static targets are located, or a space scaled up to a certain ratio; then adjusting the location of the static target in 3D space output from the previous iteration based on the static target features to obtain the location of the static target in 3D space output from the current iteration; and obtaining the location of the static target in image features based on the location of the static target in 3D space output from the current iteration.
[0010] In this embodiment of the application, after decoding to obtain the dynamic and static 3D spatial positions, the 3D spatial position of the static target can be refined, thereby making the obtained 3D spatial position of the static target more accurate.
[0011] In one possible implementation, the aforementioned method may further include: encoding the position information of the current iteration to obtain updated position information of the current iteration. Therefore, in this embodiment, the position information can be re-encoded, which is equivalent to updating the position information, thereby improving the position accuracy of dynamic and static targets.
[0012] In one possible implementation, the aforementioned fusion of dynamic and static target features obtained from the previous iteration based on location information to obtain a fusion result can include: fusing dynamic and static target features obtained from the previous iteration based on location information using an attention mechanism to obtain a fusion result. Therefore, in this embodiment, fusion can be performed based on an attention mechanism, thereby combining contextual semantics for fusion, making the obtained fusion result more accurately describe dynamic and static targets and improving subsequent perception accuracy.
[0013] In one possible implementation, the aforementioned acquisition of image features includes: acquiring an input image, which may include an image captured by a monocular camera or one or more frames captured by a multi-view camera; and extracting features from the input image using a feature extraction network to obtain image features. In this application, the input image can be a monocular image or a multi-view image; therefore, the method provided in this application can be applied to monocular or multi-view camera scenarios.
[0014] In one possible implementation, the aforementioned method further includes: segmenting based on image features to obtain information about at least one object in the input image. In this application embodiment, segmentation can also be performed based on extracted image features, which is applicable to scenarios requiring segmentation tasks.
[0015] In one possible implementation, information about at least one of the aforementioned objects is used as a constraint when sampling dynamic and static target features from image features. In this application embodiment, during target perception, constraints can be formed based on the segmentation results, thereby enabling more accurate extraction of dynamic and static target features when extracting dynamic and static target features from image features.
[0016] In one possible implementation, the aforementioned acquisition of information about dynamic and static targets based on the features of dynamic and static targets may include: acquiring a bounding box for the dynamic target based on its features, and acquiring the segmentation result and height information of the static target based on its features. In this embodiment, the location of a dynamic target can be marked, while the height information of a static target can be segmented and identified, thereby achieving the perception of both dynamic and static targets.
[0017] In one possible implementation, the input image may include images captured by a camera during vehicle operation, information on dynamic targets, and information on static targets, which are then applied to the vehicle's autonomous or assisted driving systems. Therefore, the method provided in this application can be applied to the autonomous or assisted driving systems of vehicles, improving driving safety through more accurate target perception.
[0018] Secondly, this application provides a target sensing device, comprising:
[0019] The feature extraction module is used to obtain image features, which include features extracted from the input image.
[0020] The acquisition module is used to iteratively acquire the features of dynamic targets and static targets in the input image based on image features. The objects in the input image include dynamic targets and static targets, and the moving speed of the dynamic targets is greater than that of the static targets.
[0021] The perception module is used to obtain information about dynamic targets and static targets based on their characteristics.
[0022] The process of any iteration executed by the acquisition module includes: acquiring location information, which includes information representing the location of static targets and information representing the location of dynamic targets; fusing the dynamic target features and static target features obtained in the previous iteration based on the location information to obtain a fusion result; and sampling features from image features based on the fusion result and the location information to obtain the dynamic target features and static target features of the current iteration.
[0023] The effects achieved by the second aspect and any optional implementation of the second aspect can be referred to in the description of the first aspect or any optional implementation of the first aspect, and will not be repeated here.
[0024] In one possible implementation, the acquisition module is specifically used to: acquire location information based on the dynamic target features and static target features output from the previous iteration.
[0025] In one possible implementation, the location information includes the position of the dynamic target in 3D space and its position in image features, and the position of the static target in 3D space and its position in image features. The acquisition module is specifically used to: decode the dynamic target features output from the previous iteration to obtain the position of the dynamic target in 3D space and its position in image features; adjust the position of the static target in 3D space output from the previous iteration based on the static target features to obtain the position of the static target in 3D space output from the current iteration; and obtain the position of the static target in image features based on the position of the static target in 3D space output from the current iteration.
[0026] In one possible implementation, the device further includes a position encoding module for encoding the position information of the current iteration to obtain updated position information of the current iteration.
[0027] In one possible implementation, the acquisition module is specifically used to fuse the dynamic target features and static target features obtained in the previous iteration based on the location information using an attention mechanism, so as to obtain a fusion result.
[0028] In one possible implementation, the feature extraction module is specifically used to: acquire an input image, which includes an image captured by a monocular camera or one or more frames of images captured by a multi-view camera; and extract features from the input image through a feature extraction network to obtain image features.
[0029] In one possible implementation, the apparatus further includes a segmentation module for segmenting based on image features to obtain information about at least one object in the input image.
[0030] In one possible implementation, information about at least one object is used as a constraint when sampling dynamic target features and static target features from image features.
[0031] In one possible implementation, the perception module is specifically used to obtain the bounding box of the dynamic target based on the features of the dynamic target, and to obtain the segmentation result and height information of the static target based on the features of the static target.
[0032] In one possible implementation, the input images include images captured by a camera during vehicle operation, information about dynamic targets, and information about static targets, which are then applied to the vehicle's autonomous or assisted driving.
[0033] Thirdly, embodiments of this application provide a target sensing device, including a processor and a memory, wherein the processor and the memory are interconnected via a circuit, and the processor calls program code in the memory to execute processing-related functions in the target sensing method shown in any of the first aspects above. Optionally, the target sensing device may be a chip.
[0034] Fourthly, embodiments of this application provide a target sensing device, which may also be referred to as a digital processing chip or a chip. The chip includes a processing unit and a communication interface. The processing unit obtains program instructions through the communication interface, and the program instructions are executed by the processing unit. The processing unit is used to perform processing-related functions as described in the first aspect or any optional embodiment of the first aspect.
[0035] Fifthly, embodiments of this application provide a computer-readable storage medium including instructions that, when executed on a computer, cause the computer to perform the method described in the first aspect or any optional implementation thereof.
[0036] In a sixth aspect, embodiments of this application provide a computer program product containing instructions that, when run on a computer, cause the computer to perform the method described in the first aspect or any optional implementation thereof. Attached Figure Description
[0037] Figure 1 This is a schematic diagram of an artificial intelligence framework used in this application;
[0038] Figure 2 A schematic diagram of a system architecture is provided for this application;
[0039] Figure 3 Another system architecture diagram provided for this application;
[0040] Figure 4 A schematic diagram of an application architecture provided in this application;
[0041] Figure 5 A flowchart illustrating a target perception method provided in this application;
[0042] Figure 6 A flowchart illustrating another target perception method provided in this application;
[0043] Figure 7 A schematic diagram illustrating examples of dynamic queries and static queries provided in this application;
[0044] Figure 8 A schematic diagram illustrating the steps of a 3D to 2D deformable attention module as provided in this application;
[0045] Figure 9 A schematic diagram illustrating the steps of the position adjustment module and the dynamic-static fusion attention module provided in this application;
[0046] Figure 10 A schematic diagram illustrating the output of a target perception method provided in this application;
[0047] Figure 11 A schematic diagram illustrating the output of another target perception method provided in this application;
[0048] Figure 12 A schematic diagram illustrating the output of another target perception method provided in this application;
[0049] Figure 13A schematic diagram of the structure of a target sensing device provided in this application;
[0050] Figure 14 A schematic diagram of another target sensing device provided in this application;
[0051] Figure 15 This is a schematic diagram of the structure of a chip provided in this application. Detailed Implementation
[0052] The technical solutions of the embodiments of this application will be described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the scope of protection of this application.
[0053] First, the overall workflow of the artificial intelligence system is described; please refer to [link / reference]. Figure 1 , Figure 1 The diagram illustrates a structural framework for artificial intelligence (AI). The framework is further elaborated below along two dimensions: the "Intelligent Information Chain" (horizontal axis) and the "IT Value Chain" (vertical axis). The "Intelligent Information Chain" reflects a series of processes from data acquisition to processing. For example, it could be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, and intelligent execution and output. In this process, data undergoes a condensation process of "data—information—knowledge—wisdom." The "IT Value Chain" reflects the value that AI brings to the information technology industry, from the underlying infrastructure of human intelligence and information (provided and processed through technological means) to the industrial ecosystem of the system.
[0054] (1) Infrastructure
[0055] Infrastructure provides computing power to support artificial intelligence systems, enabling communication with the external world and providing support through a basic platform. This communication occurs through sensors; computing power is provided by intelligent chips, such as hardware acceleration chips (CPUs, NPUs, GPUs, ASICs, or FPGAs); the basic platform includes distributed computing frameworks and related platform guarantees and support, which may include cloud storage and computing, interconnected networks, etc. For example, sensors communicate with the outside world to acquire data, which is then provided to intelligent chips in the distributed computing system provided by the basic platform for computation.
[0056] (2) Data
[0057] The data at the next layer of infrastructure is used to represent the data sources in the field of artificial intelligence. The data involves graphics, images, voice, text, and IoT data from traditional devices, including business data from existing systems and sensor data such as force, displacement, liquid level, temperature, and humidity.
[0058] (3) Data processing
[0059] Data processing typically includes methods such as data training, machine learning, deep learning, search, reasoning, and decision-making.
[0060] Among them, machine learning and deep learning can perform intelligent information modeling, extraction, preprocessing, and training of data by symbolizing and formalizing it.
[0061] Reasoning refers to the process in which, in a computer or intelligent system, the machine thinks and solves problems by simulating human intelligent reasoning, based on reasoning control strategies and using formalized information. Typical functions include search and matching.
[0062] Decision-making refers to the process of making decisions based on intelligent information after reasoning, and it typically provides functions such as classification, sorting, and prediction.
[0063] (4) General ability
[0064] After the data processing mentioned above, the results of the data processing can be used to form some general capabilities, such as algorithms or a general system, for example, translation, text analysis, computer vision processing, speech recognition, image recognition, etc.
[0065] (5) Smart Products and Industry Applications
[0066] Intelligent products and industry applications refer to products and applications of artificial intelligence systems in various fields. They are the encapsulation of overall artificial intelligence solutions, productizing intelligent information decision-making and realizing practical applications. Their application areas mainly include: intelligent terminals, intelligent transportation, intelligent healthcare, autonomous driving, and safe cities.
[0067] This application involves numerous neural networks and related applications of image processing. To better understand the solutions of this application, the relevant terms and concepts of neural networks and images that may be involved in this application will be introduced below.
[0068] (1) Neural Network
[0069] Neural networks can be composed of neural units, which can refer to units represented by x. s The arithmetic unit takes the intercept 1 as input, and its output can be shown in formula (1-1):
[0070]
[0071] Where s = 1, 2, ..., n, n is a natural number greater than 1, W s For x s The weights are denoted by b, where b is the bias of the neural unit. f is the activation function of the neural unit, used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal. The output signal of this activation function can be used as the input of the next convolutional layer; the activation function can be the sigmoid function. A neural network is a network formed by connecting multiple of the above-mentioned individual neural units together; that is, the output of one neural unit can be the input of another neural unit. The input of each neural unit can be connected to the local receptive field of the previous layer to extract the features of the local receptive field, which can be a region composed of several neural units.
[0072] (2) Deep Neural Networks
[0073] A deep neural network (DNN), also known as a multilayer neural network, can be understood as a neural network with multiple intermediate layers. Based on the position of these layers, the internal neural network of a DNN can be divided into three categories: input layer, intermediate layers, and output layer. Generally, the first layer is the input layer, the last layer is the output layer, and the layers in between are intermediate layers, or hidden layers. The layers are fully connected, meaning that any neuron in the i-th layer is connected to any neuron in the (i+1)-th layer.
[0074] Although DNNs appear complex, each layer can be represented as a linear relational expression: in, It is the input vector. It is the output vector. is the offset vector, also known as the bias parameter; w is the weight matrix (also called coefficients); and α() is the activation function. Each layer is simply an adjustment of the input vector. The output vector is obtained through such a simple operation. Because DNNs have many layers, the coefficients W and the offset vector... The number of these parameters is also quite large. The definitions of these parameters in DNNs are as follows: Taking the coefficient w as an example: Assuming a three-layer DNN, the linear coefficient from the 4th neuron in the second layer to the 2nd neuron in the third layer is defined as... The superscript 3 represents the layer number where coefficient W is located, while the subscript corresponds to the third layer index 2 of the output and the second layer index 4 of the input.
[0075] In summary, the coefficient from the k-th neuron in layer L-1 to the j-th neuron in layer L is defined as...
[0076] It's important to note that the input layer does not have a W parameter. In deep neural networks, more intermediate layers allow the network to better represent complex real-world situations. Theoretically, the more parameters a model has, the higher its complexity and "capacity," meaning it can perform more complex learning tasks. Training a deep neural network is essentially the process of learning the weight matrix, with the ultimate goal of obtaining the weight matrix of all layers in the trained deep neural network (a weight matrix formed by the vectors W from many layers).
[0077] (3) Convolutional Neural Network
[0078] A convolutional neural network (CNN) is a deep neural network with a convolutional structure. A CNN contains a feature extractor consisting of convolutional layers and subsampling layers, which can be viewed as a filter. A convolutional layer is a layer of neurons in a CNN that performs convolutional processing on the input signal. In a convolutional layer of a CNN, a neuron may only be connected to some of its neighboring neurons. A convolutional layer typically contains several feature planes, each composed of a series of rectangularly arranged neural units. Neural units on the same feature plane share weights, which are called the convolutional kernel. Shared weights can be understood as the way image information is extracted regardless of location. The convolutional kernel can be initialized as a matrix of random size, and during the training process of the CNN, the kernel can learn appropriate weights. Furthermore, the direct benefit of shared weights is that it reduces the connections between layers in the CNN, while also reducing the risk of overfitting.
[0079] (4) Self-attention model
[0080] Self-attention refers to the efficient encoding of a sequence of data (such as the natural language corpus "Your phone is great.") into several multi-dimensional vectors for convenient numerical computation. These multi-dimensional vectors incorporate the similarity information between each element in the sequence; this similarity is called self-attention. A self-attention model can be understood as a mapping from a query to a series of value pairs (i.e., key-value pairs). The dynamic and static target features mentioned below in this application can be understood as the query input to the model.
[0081] (5) Multi-headed Self-attention mechanism
[0082] Given the same set of queries, keys, and values, the desired model is to learn different behaviors based on the same attention mechanism, and then combine these different behaviors as knowledge, such as capturing dependencies of various ranges within a sequence (e.g., short-range and long-range dependencies). Therefore, the attention mechanism combines different representation subspaces of the query, key, and value. Compared to self-attention models, multi-head attention models increase the number of heads. The query, key, and value first undergo a linear transformation, then are input into a scaling dot-product attention mechanism, repeated h times (multiple heads, each time calculating one head, with no parameter sharing between heads; the parameters W for the linear transformation of Q, K, and V are different each time). The results of the h scaling dot-product attention iterations are then concatenated, and a final linear transformation is performed to obtain the result of the multi-head attention.
[0083] (6)Embedding: refers to the feature representation of a sample.
[0084] (7) Loss Function
[0085] In training a deep neural network, to ensure the output closely approximates the desired predicted value, we compare the network's prediction with the target value. Based on the difference, we update the weight vector of each layer (usually pre-configuring parameters before the initial update). For example, if the prediction is too high, the weight vector is adjusted to predict a lower value. This adjustment continues until the deep neural network predicts the target value or a value very close to it. Therefore, we need to predefine "how to compare the difference between the predicted and target values," which is the loss function or objective function. These are important equations used to measure the difference between the predicted and target values. Taking the loss function as an example, a higher output value (loss) indicates a greater difference, and training the deep neural network becomes a process of minimizing this loss.
[0086] (8) Backpropagation algorithm
[0087] Neural networks can employ backpropagation (BP) to correct the parameters of the initial neural network model during training, thereby reducing the reconstruction error loss. Specifically, forward propagation of the input signal to the output generates error loss; this error loss information is then propagated back to update the parameters of the initial neural network model, leading to convergence of the error loss. The backpropagation algorithm is an error-loss-driven backpropagation process aimed at obtaining the optimal parameters of the neural network model, such as the weight matrix.
[0088] (9) Mask
[0089] A mask can be understood as data similar to an image. In the embodiments of this application, by fusing an image and a mask, the focus of certain content in the image can be increased. Typically, a mask can be used to extract regions of interest (ROIs). For example, a pre-made ROI mask can be fused with the image to be processed to obtain an ROI image, where image values within the ROI remain unchanged, while image values outside the ROI are all 0. It can also serve a masking function, using a mask to shield certain areas of an image, preventing them from participating in processing or the calculation of processing parameters, or processing or statistically analyzing only the shielded areas.
[0090] The method provided in this application can be executed on a server or on a terminal device. The terminal device can be a mobile phone with image processing capabilities, a tablet personal computer (TPC), a media player, a smart TV, a laptop computer (LC), a personal digital assistant (PDA), a personal computer (PC), a camera, a camcorder, a smartwatch, a wearable device (WD), or an autonomous vehicle, etc., and this application does not limit the specific device to this type.
[0091] The system architecture provided in the embodiments of this application is described below.
[0092] See Figure 2 This application provides a system architecture 200. As shown in the system architecture 200, the data acquisition device 260 can be used to collect training data. After the data acquisition device 260 collects the training data, it stores the training data in the database 230. The training device 220 trains the target model / rule 201 based on the training data maintained in the database 230. The target model / rule 201 is the target perception model provided in this application.
[0093] The following describes how the training device 220 obtains the target model / rule 201 based on the training data. For example, the training device 220 processes multiple frames of sample images to output corresponding predicted labels, calculates the loss between the predicted labels and the original labels of the samples, and updates the classification network based on this loss until the predicted labels are close to the original labels of the samples or the difference between the predicted labels and the original labels is less than a threshold, thereby completing the training of the target model / rule 201.
[0094] The target model / rule 201 in this embodiment can specifically be a neural network, such as the neural network for target perception mentioned in this embodiment. It should be noted that in practical applications, the training data maintained in the database 230 may not all come from the data acquisition device 260; it may also be received from other devices. Furthermore, it should be noted that the training device 220 may not necessarily train the target model / rule 201 entirely based on the training data maintained in the database 230; it may also obtain training data from the cloud or other sources for model training. The above description should not be construed as limiting the embodiments of this application.
[0095] The target model / rule 201 trained using training device 220 can be applied to different systems or devices, such as... Figure 2 The execution device 210 shown can be a terminal, such as a mobile phone terminal, tablet computer, laptop computer, augmented reality (AR) / virtual reality (VR) device, in-vehicle terminal, television, etc., or it can be a server or cloud service. Figure 2 In the process, the execution device 210 is equipped with a transceiver 212, which may include an input / output (I / O) interface or other wireless or wired communication interfaces, for data interaction with external devices. Taking the I / O interface as an example, the user can input data to the I / O interface through the client device 240.
[0096] During the preprocessing of input data by the execution device 210, or during the calculation module 212 of the execution device 210 performing calculations and other related processes, the execution device 210 can call data, code, etc. in the data storage system 250 for corresponding processing, or store the data, instructions, etc. obtained from the corresponding processing into the data storage system 250.
[0097] Finally, I / O interface 212 returns the processing result to client device 240, thereby providing it to the user.
[0098] It is worth noting that the training device 220 can generate corresponding target models / rules 201 based on different training data for different objectives or tasks. The corresponding target models / rules 201 can be used to achieve the above objectives or complete the above tasks, thereby providing the user with the required results.
[0099] In the appendix Figure 2 In the scenario shown, the user can manually provide input data, which can be done through the interface provided by transceiver 212. Alternatively, client device 240 can automatically send input data to transceiver 212. If user authorization is required for client device 240 to automatically send input data, the user can set the corresponding permissions in client device 240. The user can view the output results of execution device 210 on client device 240, which can be presented in various forms such as display, sound, or animation. Client device 240 can also act as a data acquisition terminal, collecting the input data and output results of transceiver 212 as new sample data and storing them in database 230. Alternatively, data can be collected directly from transceiver 212 without going through client device 240, using the input data and output results as shown in the figure as new sample data and storing them in database 230.
[0100] It is worth noting that, attached Figure 2 This is merely a schematic diagram of a system architecture provided in an embodiment of this application. The positional relationships between the devices, components, modules, etc., shown in the diagram do not constitute any limitation. For example, in Figure 2 In this context, the data storage system 250 is an external memory relative to the execution device 210. In other cases, the data storage system 250 may also be placed within the execution device 210.
[0101] like Figure 2 As shown, the target model / rule 201 is obtained by training the training device 220. The target model / rule 201 can be the recommended model in this application embodiment.
[0102] For example, the system architecture for which the method provided in this application is applied can be as follows: Figure 3 As shown. In this system architecture 300, the server cluster 310 is implemented by one or more servers. The server cluster 310 can use data in the data storage system 250, or call program code in the data storage system 250 to implement the steps of the method provided in this application.
[0103] Users can interact with the server cluster 310 using their respective user devices (e.g., terminal 301). Each local device can represent any computing device, such as a personal computer, computer workstation, smartphone, tablet, smart camera, smart car, or other type of cellular phone, media consumption device, wearable device, set-top box, game console, etc.
[0104] Each user's local device can interact with the server cluster 310 through a communication network using any communication mechanism / standard. The communication network can be a wide area network (WAN), a local area network (LAN), a point-to-point connection, or any combination thereof. Specifically, the communication network can include a wireless network, a wired network, or a combination of both. The wireless network includes, but is not limited to, any one or more combinations of: 5th-Generation (5G) systems, Long Term Evolution (LTE) systems, Global System for Mobile Communication (GSM) or Code Division Multiple Access (CDMA) networks, Wideband Code Division Multiple Access (WCDMA) networks, Wireless Fidelity (WiFi), Bluetooth, Zigbee, Radio Frequency Identification (RFID), Long Range (Lora) wireless communication, and Near Field Communication (NFC). The wired network can include fiber optic communication networks or networks composed of coaxial cables.
[0105] In another implementation, one or more aspects of the execution device 210 may be implemented by each local device. For example, the terminal 301 may provide local data or feedback calculation results to the execution device 210.
[0106] It should be noted that all the functions of the execution device 210 can also be implemented by a local device. For example, terminal 301 implements the functions of the execution device 210 and provides services to its own users, or provides services to users of terminal 301.
[0107] The method provided in this application can be applied to various scenarios requiring target perception. For example, the method provided in this application can be applied to autonomous driving, assisted driving, or robotics. Some application scenarios are described below as examples.
[0108] Scenario 1: Autonomous Driving
[0109] The method provided in this application can be applied to the perception module of a vehicle, such as identifying dynamic or static targets from the collected perception data.
[0110] For example, the process of autonomous driving can be as follows: Figure 4 As shown, the perception data section can specifically include the collection of road surface data by various devices such as cameras, lasers, and lidar. When collecting data via laser scanning, the laser typically collects information from the external environment at a frequency of 10 FPS or other frequencies. When collecting data via cameras, the cameras typically collect external scene information at a rate of 25 or 30 FPS. Furthermore, vehicles can be equipped with monocular or multi-view cameras. A multi-view camera is a camera that captures images from different angles, or it can be understood as a camera system composed of multiple cameras capturing images from different angles.
[0111] Target detection is then performed. The targets in the perception data can be specifically divided into dynamic targets and static targets. For example, in the images collected by the cameras installed in the vehicle, dynamic targets can include objects with a certain speed, such as pedestrians and vehicles, while static targets can include fixed objects such as traffic signs, lane lines, or drivable areas (freespace).
[0112] Target tracking can smooth detection results, measure velocity, and predict target trajectories. Target tracking is a crucial part of the perception module; both visual and laser perception rely on it.
[0113] Each sensor has its own advantages in perception. Multi-sensor fusion allows each sensor to play its role, achieving the goal of the fusion result being better than the result of any single sensor.
[0114] In the planning and control module, comprehensive obstacle information is fused from multiple sensors to make reasonable path planning and control the vehicle's driving status. This module determines the vehicle's path and can generally be understood as the control center of the autonomous vehicle.
[0115] The method provided in this application can be applied to dynamic target detection and static road segmentation in autonomous driving. By fusing dynamic and static task features, it enhances information complementarity among multiple tasks, thereby improving perception accuracy.
[0116] Scenario 2: Assisted Driving
[0117] Similar to autonomous driving scenarios, the method provided in this application can also be applied to the perception module of a vehicle. The difference lies in that the vehicle can be controlled by a user, and the method provided in this application can be used to monitor dynamic and static targets near the vehicle in real time. When the user controls the vehicle, the driving direction can be adjusted, obstacle avoidance can be performed, or the surrounding environment can be displayed on the vehicle's screen, marking information about dynamic or static targets. This allows the user to more accurately understand the vehicle's surroundings and improves driving safety.
[0118] Scene 3: Robot
[0119] The method provided in this application can be applied to intelligent robots. These robots can be equipped with lidar or image sensors to collect data within their monitoring range in real time, and to identify and track objects within the collected data. For example, an image sensor can be installed in the intelligent robot to identify objects in the images collected in real time, and then track them. The intelligent robot can then perform tracking operations based on these objects, such as adjusting its orientation or direction of travel.
[0120] Taking autonomous driving as an example, autonomous driving perception typically includes important tasks such as 3D dynamic object detection and BEV static road structure recognition. Generally, a backbone network can be used to extract depth features from the input image and transform these features into the BEV space, allowing for subsequent object detection or segmentation tasks based on these BEV features. However, there is a possibility that the features may not be fully utilized.
[0121] For example, in some scenarios, a shared backbone network can be used for multi-view image feature extraction. Then, a viewpoint transformation module can convert the 2D image features to the BEV space. Finally, the BEV features are sent to multiple parallel task heads for multi-task prediction, achieving the goal of sharing computational resources and outputting multi-task results. However, the mining of each feature is insufficient, which limits the accuracy of the final output.
[0122] For example, in some scenarios, a backbone network can be used to extract depth features from multi-view images, a depth estimation network can be used to predict the depth of 2D images, and the 2D image features can be projected into 3D space using camera intrinsics and extrinsic parameters to generate a 3D pseudo-point cloud. Finally, the 3D pseudo-point cloud is captured as a 2D BEV feature. A 3D detection head or a BEV segmentation head is then connected after the BEV features to perform 3D detection or BEV segmentation. However, similarly, the mining of each feature is insufficient, which limits the accuracy of the final output. Furthermore, when the number of queries in the Transformer is large, the computational complexity increases quadratically, consuming significant computational resources.
[0123] Therefore, this application provides a target perception method that can be based on a multi-task method of fusing 3D dynamic and static contextual information with sparse queries. It uses a relatively small number of features to express dynamic and static elements in 3D space and predict their 3D spatial positions. It then uses position encoding and attention mechanisms to fuse 3D information, thereby improving the performance of pure vision 3D multi-task perception.
[0124] The method provided in this application is described below.
[0125] See Figure 5 The flowchart of a target perception method provided in this application is as follows.
[0126] 501. Obtain image features.
[0127] The image features can include features extracted from the input image. The input image can include one or more frames captured by a monocular camera, or one or more frames captured by a multi-view camera. For example, a multi-view camera can be installed in a vehicle to capture images of the vehicle's environment while the vehicle is in motion, thereby obtaining images from multiple perspectives.
[0128] This application exemplifies the use of an image captured by a multi-view camera as the input image. The input image mentioned below can be an image captured by a monocular camera or an image captured by a multi-view camera, and will not be described in detail below.
[0129] Specifically, features can be extracted from the input image using a feature extraction network. This feature extraction network can include networks with structures such as DNN or CNN, as mentioned above, or it can include constructed networks. For example, a backbone network can be used as the feature extraction network to extract features from the input image and obtain image features.
[0130] 502. Iteratively obtain the features of dynamic targets and static targets in the input image based on image features.
[0131] Generally, targets in an image can be divided into static targets and dynamic targets. Dynamic targets can include objects in the shooting scene with a non-zero speed, i.e., moving objects. Static targets are objects that are stationary in the scene or objects with a speed less than a certain value. The speed of dynamic targets is greater than the speed of static targets.
[0132] After obtaining the image features, the features of dynamic targets and static targets in the input image can be obtained iteratively based on the image features.
[0133] In any iteration, location information can be obtained first. This location information can include the position of the static target in the image features and the position of the dynamic target in the image features. Based on the location information, the dynamic target features and the static target features obtained in the previous iteration are fused to obtain the fusion result. Based on the fusion result and the location information, feature sampling is performed from the image features to obtain the dynamic target features and static target features of the current iteration.
[0134] Once the convergence condition is met, the final dynamic target features and static target features can be output. The convergence condition may specifically include reaching a preset number of convergence iterations, the difference between the output features of adjacent iterations being less than a preset difference, or the iteration duration reaching a preset duration, etc., and can be determined according to the actual application scenario.
[0135] Specifically, when acquiring location information, dynamic and static target features can be decoded to obtain the positions of the dynamic and static targets in 3D space. This 3D space can be understood as representing the space corresponding to the dynamic and static targets in the actual application scenario, or a space scaled according to a certain ratio. The dynamic and static target features are then used to adjust the 3D spatial positions of the dynamic and static targets output from the previous iteration, resulting in the current 3D positions of the dynamic and static targets. These positions are then projected into the image feature space for image feature sampling.
[0136] Typically, if the current iteration is the first iteration, the initial dynamic and static target features can be spatially decoded to obtain their prior positions in 3D space. These positions are then projected onto the image feature space for feature sampling, yielding image features used for result prediction. If the current iteration is not the first iteration, the positions of the dynamic and static targets output from the previous iteration can be refined to obtain more accurate spatial positions, and more accurate image features can be extracted based on these more accurate spatial positions.
[0137] Furthermore, to further improve the positional accuracy of dynamic and static targets, the positional information can be encoded to obtain updated positional information. Therefore, in this embodiment, the positions of dynamic and static targets can be encoded in each iteration, thereby improving the positional accuracy of dynamic and static targets.
[0138] Furthermore, optionally, the fusion of dynamic and static target features can be based on an attention mechanism. Specifically, the dynamic and static target features obtained from the previous iteration can be fused using the attention mechanism based on positional information to obtain a fusion result. Therefore, in this embodiment, the fusion of dynamic and static target features can be based on an attention mechanism, thereby enabling fusion based on the contextual semantics of dynamic and static targets. The resulting fusion result can more accurately represent the features of the target in the input image.
[0139] In one possible implementation, segmentation tasks can also be performed based on image features, such as panoptic segmentation or instance segmentation, to obtain information about at least one object in the input image. Therefore, in this embodiment, segmentation tasks can also be performed based on image features, thus adapting to scenarios requiring segmentation tasks.
[0140] Optionally, if the segmentation task is based on image features, during the sampling process from the image features in any iteration, information of at least one object obtained from the segmentation can be used as a constraint to sample dynamic target features and static target features from the image features, so that the collected features can more accurately represent dynamic targets and static targets and reduce the collected noise.
[0141] 503. Output information about dynamic targets and static targets.
[0142] After collecting dynamic and static target features, identification can be performed based on these features, thereby outputting information about the dynamic target.
[0143] Specifically, information about dynamic targets may include, but is not limited to, the bounding box corresponding to the dynamic target in the input image, the speed or direction of movement of the dynamic target, etc. Information about static targets may include, for example, the height, segmentation result, or shape of the static target in the input image.
[0144] Therefore, in the embodiments of this application, during the iterative process of target perception, dynamic target features and static target features can be fused to achieve contextual information fusion between dynamic and static targets, obtain information that can more accurately represent the relative positions of dynamic and static targets, and sample from image features based on the fusion result, thereby more accurately collecting dynamic and static target features from image features, thus improving iterative convergence efficiency, accuracy of target perception, and perception efficiency.
[0145] The method flow provided in this application has been described above. The following section will describe the method flow provided in this application in more detail, taking a specific application scenario, such as an image captured by a multi-view camera installed in a vehicle, as an example.
[0146] See Figure 6 This application provides a flowchart of another target perception method.
[0147] The target perception method provided in this application can be divided into multiple parts, which can be executed through multiple modules in the target perception model. For example... Figure 6 The diagram can be divided into feature extraction, panoramic segmentation, and target perception, which will be introduced in detail below.
[0148] I. Feature Extraction
[0149] Taking images captured by a vehicle-mounted multi-view camera as an example, such as Figure 6 As shown, a multi-view camera is installed on the vehicle. This can be achieved by placing image sensors at different locations on the vehicle or by placing image sensors with different perspectives at the same location on the vehicle, thereby capturing one or more frames of images of the vehicle's environment. The captured one or more frames are used as input to the backbone network, and the output is the features extracted from the input images, i.e., image features.
[0150] The backbone network can be used to extract features from the input image at one or more scales. This application exemplarily describes the extraction of features at multiple scales. The multi-scale features mentioned below can also be replaced with features at one scale, which will not be elaborated further below.
[0151] For example, suppose at a certain time, we acquire RGB images from K cameras, i.e., from different perspectives of multiple cameras, denoted as I = {im1, ..., im2}. K}∈R 3×H×W K images are input into a backbone network with a pyramid structure to extract pyramid image features. Where s∈{8,16,32,64} represents the downsampling factor of the image when extracting features.
[0152] This can be understood as follows: the features extracted from the input image by the feature extraction network are image features of different scales. For subsequent target detection or other tasks, the feature sequence can be extracted from the image features by the subsequent target perception module to obtain the features in 3D space corresponding to dynamic or static targets.
[0153] II. Panoramic Segmentation
[0154] Specifically, the panoramic segmentation here can also be replaced by instance segmentation or background segmentation, etc. This application will exemplarily use panoramic segmentation as an example for illustrative description.
[0155] Segmentation can be performed using a panoramic segmentation network. Image features are used as input to the network, and panoramic segmentation M = σ(P*F) is performed on images from K viewpoints. Here, F represents the image features, P represents the segmentation kernel, M represents the segmentation mask, and σ represents the softmax activation function.
[0156] Specifically, the results of panoramic segmentation can be used to assist in target perception or applied to other functions of the vehicle, which is not limited in this application.
[0157] III. Goal Perception
[0158] Target perception can be achieved through multiple iterations, with each iteration consisting of several parts. The initial and iterative phases will be described below as examples.
[0159] 1. Initial stage
[0160] Each iteration can be divided into steps executed by multiple modules. For example, the initial stage can be divided into an initial position decoding module, a dynamic and static attention fusion module, and a 3D to 2D deformable attention module, which will be introduced below.
[0161] (1) Initial position decoding module
[0162] For dynamic object detection and static road structure segmentation tasks, a set of learnable queries, namely dynamic queries and static queries, and their corresponding position representations, are initialized respectively. The 3D spatial position initialization decoding of dynamic queries and static queries can be performed by the decoder respectively.
[0163] The dynamic position representation can be achieved by decoding the (x, y, z) 3D spatial position of dynamic queries using a single fully connected (FC) layer. For static queries, the (x, y) 3D spatial position is fixed and initialized through meshing. The z-axis of static queries can be decoded using a single FC layer, representing a static road surface height estimate. For example, instances of dynamic and static queries can be represented as follows... Figure 7 As shown.
[0164] (2) Dynamic and static attention fusion module
[0165] Specifically, dynamic and static queries can be fused based on an attention mechanism. This process yields dynamic and static queries. and its corresponding position representation in 3D space The attention module is used to fuse dynamic and static elements in 3D space. l =MHA(Q l PE l ).
[0166] In this embodiment, dynamic and static targets are represented by separate queries, and their features are transferred from 2D images to 3D space for representation. The 3D spatial positions of the dynamic and static queries are encoded, and an attention mechanism is used to fuse the 3D spatial context information of the dynamic and static elements, thereby enhancing the dynamic and static features and improving perception performance. The fusion of dynamic and static elements based on the attention mechanism can improve the feature representation of task queries, thus improving the performance of the corresponding task.
[0167] (3) 3D to 2D deformable attention module
[0168] After determining the locations corresponding to dynamic and static queries, we can sample from image features based on the locations of dynamic and static targets using deformable sampling methods. This allows us to collect features from both dynamic and static targets, thus achieving an efficient and robust deformable sampling approach and improving the robustness of 3D sparse sampling. Subsequently, we can fuse the sampled features with the dynamic and static queries using an attention mechanism. We can also fuse the location representations corresponding to the dynamic and static queries to obtain updated dynamic and static queries.
[0169] For example, such as Figure 8 As shown, the 3D positions of dynamic and static targets are represented by R, and obtained through the camera matrix T. k Project it onto the pixel coordinate system (u, v) k =T k R, where k represents the camera numbered k. Then, a set of deformable offsets is learned to sample features from the projected query, as shown below:
[0170]
[0171]
[0172] Furthermore, information about dynamic and static targets can be identified based on the output dynamic and static queries. For example, the bounding boxes corresponding to dynamic targets or the geometry or height of static targets within the scene can be identified. 3D target detection can be performed based on dynamic and static queries, reconstructing BEV feature maps from static queries, and then performing BEV road segmentation and height prediction on the BEV features, outputting road segmentation masks and road geometry. 3D boxes corresponding to dynamic targets can also be identified using dynamic queries.
[0173] Furthermore, when sampling from image features, the results of panoramic segmentation can be combined to extract features corresponding to both dynamic and static targets. For example, the regions corresponding to each target obtained from panoramic segmentation can be used as constraints to extract features of dynamic targets from the regions corresponding to dynamic targets, and features of static targets from the regions corresponding to static targets, thereby making the extracted features more accurate. This can enhance the learning of fine-grained image features, improve the robustness of sparse sampling across the view, and enhance 3D perception performance.
[0174] 2. Iteration Phase
[0175] The iterative phase is similar to the initial phase and may include adjusting the 3D position module, the position encoding module, the motion-static fusion attention module, and the 3D to 2D deformable attention module, which will be introduced below.
[0176] (1) 3D position adjustment module
[0177] The dynamic and static queries output from the previous layer can be used as input to the next layer of Transformer.
[0178] In the previous layer, the 3D positions corresponding to dynamic and static targets can be output. In subsequent iterations, the dynamic queries and static queries output in the previous iteration can be obtained and decoded to obtain the 3D positions corresponding to dynamic and static targets respectively.
[0179] To improve the positioning accuracy of both static and dynamic targets, their positions can be adjusted. For example, dynamic queries can be used as input to the current-field (FC) to adjust the 3D position of the dynamic target.
[0180] (2) Location encoding module
[0181] Specifically, the adjusted 3D position can be encoded using the trained encoding module, making the new positional embedding (PE) more accurate in describing the space and avoiding inconsistencies between the old positional representation and the adjusted 3D position.
[0182] R l =R l-1 +AR l
[0183]
[0184] The location representation is as follows:
[0185] in This indicates the FC layer.
[0186] In this application, the accuracy of position encoding of dynamic and static elements in 3D space is improved by using dynamic position encoding during the iteration process, and the consistency between the 3D position and 3D encoding of the adjusted queries is improved, thereby improving the prediction accuracy of the 3D position of dynamic and static elements and the accuracy of attention-based fusion.
[0187] (3) Dynamic and static integration attention module
[0188] Similar to the static-dynamic fusion process in the initial iteration, dynamic and static 3D contextual information is fused onto the encoded position representation.
[0189] For example, the steps performed by the position adjustment module and the motion-static fusion attention module can be as follows: Figure 9 As shown, dynamic queries and static queries can be represented as follows: Its corresponding position representation in 3D space is as follows: Fusion of dynamic and static elements in 3D space using an attention mechanism. l =MHA(Q l PE l ).
[0190] (4) 3D to 2D deformable attention module
[0191] The steps performed by the 3D to 2D deformable attention module in the iteration phase are similar to those performed by the 3D to 2D deformable attention module in the initial phase, and will not be repeated here.
[0192] Multiple iterations can be performed during the iteration phase. The specific number of iterations can be determined according to the actual application scenario, and this application does not limit it.
[0193] Therefore, in this embodiment, during the iterative process of target perception, dynamic and static target features are fused based on an attention mechanism. This allows for the fusion of contextual semantics between dynamic and static target features, thereby strengthening the features of both dynamic and static targets, improving their feature representation capabilities, and ultimately enhancing perception capabilities. In each iteration, the 3D positions of dynamic and static targets can be encoded, improving the accuracy of position encoding in 3D space and enhancing the consistency between the adjusted feature sequence's 3D position and 3D encoding. This improves the prediction accuracy of 3D positions for dynamic and static elements and the accuracy of attention-based fusion. Furthermore, a network framework based on image panoramic segmentation can be used to enhance multi-view sparse 3D detection and road structure cognition, strengthening the learning of fine-grained image features, improving the robustness of surround-view sparse sampling, and enhancing 3D perception performance.
[0194] To further illustrate the effects of the method provided in this application, the perception effects achieved by the method are described below in conjunction with specific application scenarios and commonly used target perception methods.
[0195] The method provided in this application can be validated on existing datasets based on DETR3D as a baseline. Table 1 and Table 2 show the validation results based on different datasets, which are significantly improved compared to the baseline. Table 3 shows the road structure segmentation and road height estimation results validated on the dataset. Clearly, this application can achieve better road detection results.
[0196]
[0197] Table 1
[0198]
[0199] Table 2
[0200] Perception methods Driving area Lane boundary Geometric estimation PON
[21] 60.40 - - CNN[4] 68.96 16.51 - OFT
[41] ,
[16] 71.69 18.07 - Lift-Splat[4] 72.94 19.96 - DI3D (this application) 78.41 28.25 0.071
[0201] Table 3 shows that different modules also improve the ability to detect objects in images, as shown in Table 4.
[0202]
[0203] Table 4
[0204] Furthermore, taking a specific lane as an example, such as Figure 10 As shown, this application can make the learned image features clearer and more detailed through panoramic segmentation, improve the performance of 3D perception, and enhance the robustness of sparse sampling.
[0205] like Figure 11 As shown, the top row represents a common learnable positional encoding, but after training, this encoding remains unchanged during inference, resulting in somewhat chaotic attention positions. The bottom row demonstrates the effect of dynamic positional encoding. When a multi-layer Transformer predicts the 3D position of each query, it adjusts the 3D coordinates layer by layer. Dynamic positional encoding, on the other hand, regenerates a consistent positional encoding based on the adjusted 3D coordinates. The visualization of its attention shows that dynamic positional encoding makes the query focus more on local region features and road structure areas, resulting in a more reasonable performance.
[0206] like Figure 12 The figure shows the effect of learnable height in the 3D coordinates of static features. The ground truth, predicted static road height, and height projection points are marked. It can be seen that static height prediction provides a fairly accurate 3D spatial location prediction, which is beneficial to the accuracy of contextual information fusion between dynamic and static elements in 3D space. At the same time, the features of the sampled image locations are also more accurate and reasonable.
[0207] The method flow provided in this application has been described above. The apparatus for performing the method provided in this application is described below.
[0208] See Figure 13 The present application provides a schematic diagram of the structure of a target sensing device, as shown below.
[0209] Feature extraction module 1301 is used to acquire image features, including features extracted from the input image;
[0210] The acquisition module 1302 is used to iteratively acquire the features of dynamic targets and static targets in the input image based on image features. The objects in the input image include dynamic targets and static targets, and the moving speed of the dynamic targets is greater than the moving speed of the static targets.
[0211] The perception module 1303 is used to obtain information about dynamic targets and static targets based on the characteristics of dynamic targets and static targets;
[0212] The process of any iteration executed by the acquisition module 1302 includes: acquiring location information, which includes information representing the location of a static target and information representing the location of a dynamic target; fusing the dynamic target features and static target features obtained in the previous iteration based on the location information to obtain a fusion result; and sampling features from image features based on the fusion result and the location information to obtain the dynamic target features and static target features of the current iteration.
[0213] In one possible implementation, the acquisition module 1302 is specifically used to: acquire location information based on the dynamic target features and static target features output from the previous iteration.
[0214] In one possible implementation, the location information includes the position of the dynamic target in 3D space and its position in image features, and the position of the static target in 3D space and its position in image features. The acquisition module 1302 is specifically used to: decode the dynamic target features output from the previous iteration to obtain the position of the dynamic target in 3D space and its position in image features; adjust the position of the static target in 3D space output from the previous iteration based on the static target features to obtain the position of the static target in 3D space output from the current iteration; and obtain the position of the static target in image features based on the position of the static target in 3D space output from the current iteration.
[0215] In one possible implementation, the device further includes a position encoding module 1304, used to encode the position information of the current iteration to obtain the updated position information of the current iteration.
[0216] In one possible implementation, the acquisition module 1302 is specifically used to fuse the dynamic target features and static target features obtained in the previous iteration based on the location information using an attention mechanism, so as to obtain a fusion result.
[0217] In one possible implementation, the feature extraction module 1301 is specifically used to: acquire an input image, which includes an image captured by a monocular camera or one or more frames of images captured by a multi-view camera; and extract features from the input image through a feature extraction network to obtain image features.
[0218] In one possible implementation, the apparatus further includes a segmentation module 1305, configured to segment based on image features to obtain information about at least one object in the input image.
[0219] In one possible implementation, information about at least one object is used as a constraint when sampling dynamic target features and static target features from image features.
[0220] In one possible implementation, the perception module 1303 is specifically used to obtain the bounding box of the dynamic target based on the features of the dynamic target, and to obtain the segmentation result and height information of the static target based on the features of the static target.
[0221] In one possible implementation, the input images include images captured by a camera during vehicle operation, information about dynamic targets, and information about static targets, which are then applied to the vehicle's autonomous or assisted driving.
[0222] Please see Figure 14 The following is a schematic diagram of another target sensing device provided in this application.
[0223] The target sensing device may include a processor 1401 and a memory 1402. The processor 1401 and the memory 1402 are interconnected via a circuit. The memory 1402 stores program instructions and data.
[0224] The aforementioned are stored in memory 1402 Figures 4-12 The program instructions and data for the steps executed by the target sensing device.
[0225] Processor 1401 is used to perform the aforementioned Figures 4-12 The method steps performed by the target sensing device.
[0226] Optionally, the target sensing device may also include a transceiver 1403 for receiving or sending data.
[0227] This application also provides a computer-readable storage medium storing a program that, when run on a computer, causes the computer to perform the aforementioned actions. Figures 4-12 The steps in the method described in the illustrated embodiment.
[0228] Alternatively, the aforementioned Figure 14 The target sensing device shown is a chip.
[0229] This application embodiment also provides a target sensing device, which can also be referred to as a digital processing chip or a chip. The chip includes a processing unit and a communication interface. The processing unit obtains program instructions through the communication interface, and the program instructions are executed by the processing unit. The processing unit is used to perform the aforementioned... Figures 4-12 The method steps performed by the target sensing device shown in any of the embodiments.
[0230] This application also provides a digital processing chip. This digital processing chip integrates circuitry for implementing the functions of the processor 1601 or processor 1401 described above, and one or more interfaces. When the digital processing chip integrates a memory, it can complete the method steps of any one or more of the foregoing embodiments. When the digital processing chip does not integrate a memory, it can be connected to an external memory via a communication interface. The digital processing chip implements the actions described above based on the program code stored in the external memory.
[0231] This application also provides a computer program product that, when run on a computer, causes the computer to perform the aforementioned actions. Figures 4-12 The steps in the method described in the illustrated embodiment.
[0232] The target sensing device or target sensing device provided in this application embodiment can be a chip, which includes a processing unit and a communication unit. The processing unit can be, for example, a processor, and the communication unit can be, for example, an input / output interface, pins, or circuits. The processing unit can execute computer execution instructions stored in the storage unit to cause the chip in the server to perform the above-mentioned operations. Figures 4-12 The target perception method described in the illustrated embodiment. Optionally, the storage unit is a storage unit within the chip, such as a register, cache, etc. The storage unit can also be a storage unit located outside the chip within the wireless access device, such as a read-only memory (ROM) or other types of static storage devices capable of storing static information and instructions, such as random access memory (RAM).
[0233] Specifically, the aforementioned processing unit or processor can be a central processing unit (CPU), a neural-network processing unit (NPU), a graphics processing unit (GPU), a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor can be a microprocessor or any conventional processor.
[0234] For example, please refer to Figure 15 , Figure 15 This is a schematic diagram of a chip provided in an embodiment of this application. The chip can be represented as a neural network processor (NPU) 150. The NPU 150 is mounted as a coprocessor on the host CPU, and tasks are assigned by the host CPU. The core part of the NPU is the arithmetic circuit 1503, which is controlled by the controller 1504 to extract matrix data from the memory and perform multiplication operations.
[0235] In some implementations, the arithmetic circuit 1503 internally includes multiple process engines (PEs). In some implementations, the arithmetic circuit 1503 is a two-dimensional pulsating array. The arithmetic circuit 1503 can also be a one-dimensional pulsating array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 1503 is a general-purpose matrix processor.
[0236] For example, suppose we have an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit retrieves the corresponding data of matrix B from the weight memory 1502 and caches it in each PE of the arithmetic circuit. The arithmetic circuit retrieves the data of matrix A from the input memory 1501 and performs matrix operations with matrix B. The partial result or the final result of the obtained matrix is stored in the accumulator 1508.
[0237] Unified memory 1506 is used to store input and output data. Weight data is directly transferred to weight memory 1502 via direct memory access controller (DMAC) 1505. Input data is also transferred to unified memory 1506 via DMAC.
[0238] The bus interface unit (BIU) 1510 is used for interaction between the AXI bus and the DMAC and instruction fetch buffer (IFB) 1509.
[0239] The bus interface unit 1510 (BIU) is used by the instruction fetch memory 1509 to fetch instructions from external memory, and also by the memory access controller 1505 to fetch the original data of the input matrix A or the weight matrix B from external memory.
[0240] The DMAC is mainly used to move input data from external memory DDR to unified memory 1506, or to weight data to weight memory 1502, or to input data to input memory 1501.
[0241] The vector computation unit 1507 includes multiple arithmetic processing units that further process the output of the computation circuit as needed, such as vector multiplication, vector addition, exponential operations, logarithmic operations, size comparisons, etc. It is mainly used for computation in non-convolutional / fully connected layers of neural networks, such as batch normalization, pixel-level summation, and upsampling of feature planes.
[0242] In some implementations, the vector computation unit 1507 can store the processed output vector in the unified memory 1506. For example, the vector computation unit 1507 can apply linear and / or nonlinear functions to the output of the computation circuit 1503, such as performing linear interpolation on feature planes extracted by convolutional layers, or accumulating a vector of values to generate activation values. In some implementations, the vector computation unit 1507 generates normalized values, pixel-level summed values, or both. In some implementations, the processed output vector can be used as activation input to the computation circuit 1503, for example, for use in subsequent layers of the neural network.
[0243] The instruction fetch buffer 1509 connected to the controller 1504 is used to store the instructions used by the controller 1504;
[0244] Unified memory 1506, input memory 1501, weighted memory 1502, and instruction fetch memory 1509 are all on-chip memories. External memory is proprietary to this NPU hardware architecture.
[0245] The operations of each layer in the recurrent neural network can be performed by the operation circuit 1503 or the vector calculation unit 1507.
[0246] The processor mentioned above can be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits used to control the above. Figures 4-12 The procedure of the method.
[0247] It should also be noted that the device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate, and the components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. In addition, in the device embodiment drawings provided in this application, the connection relationship between modules indicates that they have a communication connection, which can be implemented as one or more communication buses or signal lines.
[0248] Through the above description of the embodiments, those skilled in the art can clearly understand that this application can be implemented by means of software plus necessary general-purpose hardware, or it can be implemented by special-purpose hardware including application-specific integrated circuits, special-purpose CPUs, special-purpose memory, special-purpose components, etc. Generally, any function performed by a computer program can be easily implemented by corresponding hardware, and the specific hardware structure used to implement the same function can also be diverse, such as analog circuits, digital circuits, or special-purpose circuits. However, for this application, software program implementation is more often the preferred implementation method. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product is stored in a readable storage medium, such as a computer floppy disk, USB flash drive, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk, or optical disk, etc., including several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments of this application.
[0249] In the above embodiments, implementation can be achieved, in whole or in part, through software, hardware, firmware, or any combination thereof. When implemented in software, it can be implemented, in whole or in part, as a computer program product.
[0250] The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of this application are generated. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium may be any available medium that a computer can store or a data storage device such as a server or data center that integrates one or more available media. The available medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid-state disk (SSD)).
[0251] The terms “first,” “second,” “third,” “fourth,” etc. (if present) in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a particular order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments described herein can be implemented in a sequence other than that illustrated or described herein. Furthermore, the terms “comprising” and “having,” and any variations thereof, are intended to cover a non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.
Claims
1. A target perception method, characterized in that, include: Acquire image features, including features extracted from an input image, wherein the input image includes one or more frames of images taken at the same time. The dynamic target features and static target features in the input image are obtained iteratively based on the image features. The objects in the input image include dynamic targets and static targets, and the moving speed of the dynamic targets is greater than the moving speed of the static targets. Information about the dynamic target and information about the static target are obtained based on the characteristics of the dynamic target and the characteristics of the static target. Each iteration process includes: obtaining location information based on the dynamic target features and static target features obtained in the previous iteration, wherein the location information includes information representing the location of the static target and information representing the location of the dynamic target; fusing the dynamic target features and static target features obtained in the previous iteration based on the location information to obtain a fusion result; and performing feature sampling from the image features based on the fusion result and the location information to obtain the dynamic target features and static target features for the current iteration.
2. The method according to claim 1, characterized in that, The location information includes the position of the dynamic target in 3D space and its position in the image features, and the position of the static target in 3D space and its position in the image features. Obtaining the location information based on the dynamic target features and static target features output from the previous iteration includes: The dynamic target features output from the previous iteration are decoded to obtain the position of the dynamic target in the 3D space and the position of the dynamic target in the image features; The position of the static target in the 3D space is adjusted according to the static target features of the previous iteration to obtain the position of the static target in the 3D space of the current iteration. The position of the static target in the image features is obtained based on the position of the static target in the 3D space output by the current iteration.
3. The method according to claim 1 or 2, characterized in that, The method further includes: The position information of the current iteration is encoded to obtain the updated position information of the current iteration.
4. The method according to claim 1, characterized in that, The step of fusing the dynamic target features and static target features obtained from the previous iteration based on the location information to obtain the fusion result includes: Based on the attention mechanism, the dynamic target features and static target features obtained in the previous iteration are fused according to the location information to obtain the fusion result.
5. The method according to claim 1, characterized in that, The acquisition of image features includes: The input image is obtained, which includes an image captured by a monocular camera or one or more frames captured by a multi-view camera; The image features are obtained by extracting features from the input image using a feature extraction network.
6. The method according to claim 1, characterized in that, The method further includes: The image features are segmented to obtain information about at least one object in the input image.
7. The method according to claim 6, characterized in that, The information of at least one object is used as a constraint when sampling the dynamic target features and the static target features from the image features.
8. The method according to claim 1, characterized in that, The step of obtaining information about the dynamic target and the static target based on the characteristics of the dynamic target and the static target includes: The bounding box of the dynamic target is obtained based on the features of the dynamic target, and the segmentation result and height information of the static target are obtained based on the features of the static target.
9. The method according to claim 1, characterized in that, The input image includes images captured by a camera during vehicle operation, and the information of the dynamic target and the information of the static target are applied to the vehicle's autonomous driving or assisted driving.
10. A target sensing device, characterized in that, include: The feature extraction module is used to acquire image features, including features extracted from the input image, which includes one or more frames of images at the same time. The acquisition module is used to iteratively acquire the features of dynamic targets and static targets in the input image based on the image features. The objects in the input image include the dynamic targets and the static targets, and the moving speed of the dynamic targets is greater than the moving speed of the static targets. The perception module is used to obtain information about the dynamic target and the static target based on the characteristics of the dynamic target and the characteristics of the static target; The acquisition module performs an iterative process that includes: acquiring location information based on the dynamic and static target features output from the previous iteration, wherein the location information includes information representing the location of the static target and information representing the location of the dynamic target; fusing the dynamic and static target features obtained from the previous iteration based on the location information to obtain a fusion result; and performing feature sampling from the image features based on the fusion result and the location information to obtain the dynamic and static target features for the current iteration.
11. The apparatus according to claim 10, characterized in that, The location information includes the position of the dynamic target in 3D space and its position in the image features, and the position of the static target in 3D space and its position in the image features. The acquisition module is specifically used for: The dynamic target features output from the previous iteration are decoded to obtain the position of the dynamic target in the 3D space and the position of the dynamic target in the image features; The position of the static target in the 3D space is adjusted according to the static target features of the previous iteration to obtain the position of the static target in the 3D space of the current iteration. The position of the static target in the image features is obtained based on the position of the static target in the 3D space output by the current iteration.
12. The apparatus according to claim 10 or 11, characterized in that, The device further includes: The position encoding module is used to encode the position information of the current iteration to obtain the updated position information of the current iteration.
13. The apparatus according to claim 10, characterized in that, The acquisition module is specifically used to fuse the dynamic target features and static target features obtained in the previous iteration based on the location information using an attention mechanism, so as to obtain the fusion result.
14. The apparatus according to claim 10, characterized in that, The feature extraction module is specifically used for: The input image is obtained, which includes an image captured by a monocular camera or one or more frames captured by a multi-view camera; The image features are obtained by extracting features from the input image using a feature extraction network.
15. The apparatus according to claim 10, characterized in that, The device further includes: The segmentation module is used to segment the image based on its features to obtain information about at least one object in the input image.
16. The apparatus according to claim 15, characterized in that, The information of at least one object is used as a constraint when sampling the dynamic target features and the static target features from the image features.
17. The apparatus according to claim 10, characterized in that, The perception module is specifically used to obtain the bounding box of the dynamic target based on the features of the dynamic target, and to obtain the segmentation result and height information of the static target based on the features of the static target.
18. The apparatus according to claim 10, characterized in that, The input image includes images captured by a camera during vehicle operation, and the information of the dynamic target and the information of the static target are applied to the vehicle's autonomous driving or assisted driving.
19. A target sensing device, characterized in that, The method includes a processor coupled to a memory storing a program, wherein the program instructions stored in the memory are executed by the processor to implement the method of any one of claims 1 to 9.
20. A computer-readable storage medium comprising a program, which, when executed by a processing unit, performs the method as described in any one of claims 1 to 9.
21. A target sensing device, characterized in that, The system includes a processing unit and a communication interface. The processing unit obtains program instructions through the communication interface, and when the program instructions are executed by the processing unit, the system implements the method of any one of claims 1 to 9.
22. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the method as described in any one of claims 1 to 9.