Method and system for real-time three-dimensional scene graph creation
By generating 3D scene maps of autonomous vehicles using a bird's-eye view encoder and parallel decoder, the problem of lack of 3D information prediction in existing technologies is solved, enabling more accurate trajectory planning and safer navigation.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- GM GLOBAL TECHNOLOGY OPERATIONS LLC
- Filing Date
- 2025-02-10
- Publication Date
- 2026-06-12
AI Technical Summary
Existing methods for constructing scene maps for autonomous vehicles lack 3D information prediction, resulting in inaccurate trajectory planning for vehicles in complex driving scenarios, making safe and efficient navigation difficult.
A bird's-eye view encoder is used to encode sensor data to generate a feature embedding sequence. The decoded feature embedding sequence is then processed by a feature-specific decoder and a prediction network executed in parallel to generate a 3D view of the current vehicle scene and an adjacency matrix. A topological network is used to process semantic features to predict the strength of relationships between elements.
It enables autonomous vehicles to accurately understand complex 3D scenes, improving the accuracy and safety of trajectory planning, especially in environments with different altitudes, obstacles, and dynamic elements.
Smart Images

Figure CN122199779A_ABST
Abstract
Description
[0001] introduce
[0002] The information provided in this section is intended to generally present the background of this disclosure. The work of the inventors listed herein (within the scope described in this section) and aspects of the specification that may otherwise not be considered prior art at the time of filing are neither expressly nor implied to be prior art to this disclosure.
[0003] This disclosure generally relates to the creation of real-time 3D scene maps. In the field of autonomous vehicle technology, scene maps are crucial for enabling vehicles to perceive and interact with their surroundings. Currently, scene maps are generated using a combination of sensor data from cameras, light detection and ranging (LiDAR), radar, and ultrasonic sensors. These sensors provide a representation of the vehicle's environment, which is then processed to identify objects, their positions, and their movements. The vehicle's autonomous systems use this information to make decisions regarding navigation and maneuvering. However, existing methods primarily focus on 2D data.
[0004] Despite advancements in sensor technology, current scene mapping methods lack the integration of 3D information prediction. 3D prediction of scene maps enables vehicles to plan more precise trajectories, especially in complex driving scenarios. For example, 3D scene maps are crucial for capturing the spatial relationships and depth information needed for accurate vehicle trajectory and maneuver planning. Utilizing this critical data, autonomous vehicles can truly understand the scene for safe and efficient navigation, particularly in environments with varying altitudes, obstacles, and dynamic elements. Summary of the Invention
[0005] One aspect of this disclosure provides a computer-implemented method that, when executed on data processing hardware, causes the hardware to perform operations including: receiving sensor data corresponding to a two-dimensional (2D) image of a current scene of a vehicle as input to a transformer model; and encoding the sensor data using a bird's-eye view encoder to generate a corresponding feature embedding sequence. Here, the feature embedding sequence corresponds to a three-dimensional (3D) representation of the current scene of the vehicle. The operation further includes decoding the feature embedding sequence using two or more feature-specific decoders executed in parallel, and processing the decoded feature embedding sequence using a prediction network to convert the decoded feature embedding sequence into semantic features. The operation further includes processing the semantic features and the decoded feature embedding sequence using a topological network to generate an adjacency matrix representing a 3D view of the current scene of the vehicle.
[0006] Implementations of this disclosure may include one or more of the following optional features. In some implementations, the two or more feature-specific decoders executed in parallel each include multiple transformer layers. In some implementations, each transformer layer may include a cross-attention head. Additionally or alternatively, decoding a feature embedding sequence using the two or more feature-specific decoders executed in parallel may include performing cross-attention of the feature embedding sequence between corresponding transformer layers of the two or more feature-specific decoders.
[0007] In some examples, the sensor data comprises a set of image frames. In these examples, the operation may further include: for each image frame in the set of image frames, extracting a feature embedding of the current scene. Here, encoding the sensor data using a bird's-eye view encoder to generate a corresponding sequence of feature embeddings may include projecting the sensor data into the corresponding sequence of feature embeddings. In some implementations, the prediction network comprises a multilayer perceptron network. In some examples, the 2D image of the vehicle's current scene comprises at least two elements. In these examples, the adjacency matrix can predict the strength of the relationship between the at least two elements.
[0008] Another aspect of this disclosure provides a system including data processing hardware and memory hardware in communication with the data processing hardware. The memory hardware stores instructions that, when executed by the data processing hardware, cause the data processing hardware to perform operations. The operations include receiving sensor data corresponding to a two-dimensional (2D) image of a current scene of a vehicle as input to a transformer model, and encoding the sensor data using a bird's-eye view encoder to generate a corresponding feature embedding sequence. Here, the feature embedding sequence corresponds to a three-dimensional (3D) representation of the current scene of the vehicle. The operations further include decoding the feature embedding sequence using two or more feature-specific decoders executed in parallel, and processing the decoded feature embedding sequence using a prediction network to convert the decoded feature embedding sequence into semantic features. The operations further include processing the semantic features and the decoded feature embedding sequence using a topological network to generate an adjacency matrix representing a 3D view of the current scene of the vehicle.
[0009] This aspect may include one or more of the following optional features. In some implementations, the two or more feature-specific decoders executed in parallel each include multiple transformer layers. In these implementations, each transformer layer may include a cross-attention head. Additionally or alternatively, decoding the feature embedding sequence using the two or more feature-specific decoders executed in parallel may include performing cross-attention of the feature embedding sequence between the corresponding transformer layers of the two or more feature-specific decoders.
[0010] In some examples, the sensor data comprises a set of image frames. In these examples, the operation may further include, for each image frame in the set, extracting a feature embedding of the current scene. Here, encoding the sensor data using a bird's-eye view encoder to generate a corresponding sequence of feature embeddings may include projecting the sensor data into the corresponding sequence of feature embeddings. In some examples, the 2D image of the vehicle's current scene comprises at least two elements. In these examples, the adjacency matrix may predict the strength of the relationship between the at least two elements.
[0011] Another aspect of this disclosure provides a computer-implemented method that, when executed on data processing hardware, causes the data processing hardware to perform operations. The operations include receiving sensor data corresponding to a two-dimensional (2D) image of a current scene of a vehicle as input to a transformer model, the 2D image comprising at least two elements. The operations further include encoding the sensor data using a bird's-eye view encoder to generate a corresponding feature embedding sequence. Here, the feature embedding sequence corresponds to a three-dimensional (3D) representation of the current scene of the vehicle. The operations further include decoding the feature embedding sequence using two or more feature-specific decoders executed in parallel, and processing the decoded feature embedding sequence using a topological network to generate an adjacency matrix representing a 3D view of the current scene of the vehicle. Here, the adjacency matrix predicts the strength of the relationship between the at least two elements.
[0012] This disclosure provides the following examples:
[0013] Example 1. A computer-implemented method executed on data processing hardware, the method causing the data processing hardware to perform operations including:
[0014] Receive sensor data corresponding to a two-dimensional (2D) image of the current scene of the vehicle as input to the converter model;
[0015] The sensor data is encoded using a bird's-eye view encoder to generate a corresponding feature embedding sequence, which corresponds to a three-dimensional (3D) representation of the current scene of the vehicle.
[0016] The feature embedding sequence is decoded using two or more feature-specific decoders that are executed in parallel;
[0017] The decoded feature embedding sequence is processed using a predictive network to convert the decoded feature embedding sequence into semantic features; and
[0018] The semantic features and the decoded feature embedding sequence are processed using a topological network to generate an adjacency matrix representing a 3D view of the current scene of the vehicle.
[0019] Example 2. The method according to Example 1, wherein the two or more feature-specific decoders executed in parallel each include multiple transformer layers.
[0020] Example 3. The method described in Example 2, wherein each transformer layer includes a cross attention head.
[0021] Example 4. According to the method of Example 2, wherein decoding the feature embedding sequence using the two or more feature-specific decoders executed in parallel includes performing cross-attention of the feature embedding sequence between corresponding transformer layers of the two or more feature-specific decoders.
[0022] Example 5. The method according to Example 1, wherein the sensor data includes a set of image frames.
[0023] Example 6. The method according to Example 5, wherein the operation further includes: for each image frame in the set of image frames, extracting the feature embedding of the current scene.
[0024] Example 7. The method according to Example 6, wherein encoding sensor data using a bird's-eye view encoder to generate a corresponding feature embedding sequence includes projecting the sensor data into the corresponding feature embedding sequence.
[0025] Example 8. The method according to Example 1, wherein the prediction network comprises a multilayer perceptron network.
[0026] Example 9. The method described in Example 1, wherein the 2D image of the current scene of the vehicle includes at least two elements.
[0027] Example 10. The method according to Example 9, wherein the adjacency matrix predicts the strength of the relationship between the at least two elements.
[0028] Example 11. A system comprising:
[0029] Data processing hardware; and
[0030] Memory hardware that communicates with the data processing hardware, the memory hardware storing instructions that, when executed on the data processing hardware, cause the data processing hardware to perform operations including the following:
[0031] Receive sensor data corresponding to a two-dimensional (2D) image of the current scene of the vehicle as input to the converter model;
[0032] The sensor data is encoded using a bird's-eye view encoder to generate a corresponding feature embedding sequence, which corresponds to a three-dimensional (3D) representation of the current scene of the vehicle.
[0033] The feature embedding sequence is decoded using two or more feature-specific decoders that are executed in parallel;
[0034] The decoded feature embedding sequence is processed using a predictive network to convert the decoded feature embedding sequence into semantic features; and
[0035] The semantic features and the decoded feature embedding sequence are processed using a topological network to generate an adjacency matrix representing a 3D view of the current scene of the vehicle.
[0036] Example 12. The system according to Example 11, wherein the two or more feature-specific decoders that are executed in parallel each include multiple transformer layers.
[0037] Example 13. The system according to Example 12, wherein each transformer layer includes a cross attention head.
[0038] Example 14. The system according to Example 12, wherein decoding the feature embedding sequence using the two or more feature-specific decoders executed in parallel includes performing cross-attention of the feature embedding sequence between corresponding transformer layers of the two or more feature-specific decoders.
[0039] Example 15. The system according to Example 11, wherein the sensor data includes a set of image frames.
[0040] Example 16. The system according to Example 15, wherein the operation further includes, for each of the set of image frames, extracting the feature embedding of the current scene.
[0041] Example 17. The system according to Example 16, wherein encoding sensor data using a bird's-eye view encoder to generate a corresponding feature embedding sequence includes projecting the sensor data into the corresponding feature embedding sequence.
[0042] Example 18. The system according to Example 11, wherein the 2D image of the current scene of the vehicle includes at least two elements.
[0043] Example 19. The system according to Example 18, wherein the adjacency matrix predicts the strength of the relationship between the at least two elements.
[0044] Example 20. A computer-implemented method executed on data processing hardware, the method causing the data processing hardware to perform operations including:
[0045] Sensor data corresponding to a two-dimensional (2D) image of the current scene of the vehicle is received as input to the converter model, the 2D image including at least two elements;
[0046] The sensor data is encoded using a bird's-eye view encoder to generate a corresponding feature embedding sequence, which corresponds to a three-dimensional (3D) representation of the current scene of the vehicle.
[0047] The feature embedding sequence is decoded using two or more feature-specific decoders that are executed in parallel;
[0048] The decoded feature embedding sequence is processed using a topological network to generate an adjacency matrix representing a 3D view of the current scene of the vehicle, the adjacency matrix predicting the strength of the relationship between the at least two elements.
[0049] Details of one or more embodiments of the present disclosure are set forth in the accompanying drawings and the following description. Other aspects, features, and advantages will be apparent from the description, the drawings, and the claims. Attached Figure Description
[0050] The accompanying drawings described herein are for illustrative purposes only and are not intended to limit the scope of this disclosure.
[0051] Figure 1 This is a schematic view of an example system for creating real-time 3D scene graphs.
[0052] Figure 2 yes Figure 1 A schematic view of example components of the system.
[0053] Figure 3A and 3B These are example views of a two-dimensional scene and its corresponding three-dimensional scene.
[0054] Figure 4 yes Figure 1 A schematic view of an example feature decoder for the system's transformer model.
[0055] Figure 5 This is a flowchart illustrating an example operation setup for creating real-time 3D scene graphs.
[0056] Figure 6 This is a flowchart illustrating an example operation setup for creating real-time 3D scene graphs.
[0057] Throughout the accompanying figures, the corresponding figure labels indicate the relevant parts. Detailed Implementation
[0058] The example configuration will now be described more fully with reference to the accompanying drawings. The example configuration is provided so that this disclosure will be comprehensive and will fully convey the scope of this disclosure to those skilled in the art. Specific details, such as examples of specific components, devices, and methods, are set forth to provide a comprehensive understanding of the configuration of this disclosure. It will be apparent to those skilled in the art that the specific details are not required, the example configuration may be embodied in many different forms, and the specific details and example configuration should not be construed as limiting the scope of this disclosure.
[0059] The terminology used herein is for describing specific exemplary configurations only and is not intended to be limiting. As used herein, the singular articles “a,” “an,” and “the” may also be intended to include plural forms unless the context explicitly indicates otherwise. The terms “comprising,” “including,” and “having” are inclusive and therefore specify the presence of features, steps, operations, elements, and / or components, but do not exclude the presence or addition of one or more other features, steps, operations, elements, components, and / or groups thereof. The method steps, processes, and operations described herein should not be construed as necessarily requiring them to be performed in the specific order discussed or illustrated, unless explicitly identified as such. Additional or alternative steps may be employed.
[0060] When an element or layer is referred to as “on another element or layer,” “joined to,” “connected to,” “attached to,” or “coupled to” another element or layer, it may be located directly on, joined to, connected to, attached to, or coupled to that other element or layer, or there may be intermediate elements or layers present. Conversely, when an element is referred to as “directly on another element or layer,” “directly joined to,” “directly connected to,” “directly attached to,” or “directly coupled to” another element or layer, there may be no intermediate elements or layers present. Other terms used to describe relationships between elements should be interpreted in a similar manner (e.g., “between” vs. “directly between,” “adjacent” vs. “directly adjacent,” etc.). As used herein, the term “and / or” includes any and all combinations of one or more of the related listed items.
[0061] The terms “first,” “second,” “third,” etc., are used herein to describe various elements, components, regions, layers, and / or sections. These elements, components, regions, layers, and / or sections should not be limited by these terms. These terms are used only to distinguish one element, component, region, layer, or section from another. Unless the context explicitly indicates otherwise, terms such as “first,” “second,” and other numerical terms do not imply order or sequence. Therefore, the first element, component, region, layer, or section discussed below may be referred to as the second element, component, region, layer, or section without departing from the teachings of the example configuration.
[0062] In this application, the term "module" is used in place of the term "circuit". The term "module" may refer to, be part of, or include the following: Application-Specific Integrated Circuit (ASIC); Digital, Analog, or Mixed-Analog / Digital Discrete Circuit; Digital, Analog, or Mixed-Analog / Digital Integrated Circuit; Combinational Logic Circuit; Field-Programmable Gate Array (FPGA); Processor (shared, dedicated, or grouped) that executes code; Memory (shared, dedicated, or grouped) that stores code executed by the processor; Other suitable hardware components that provide the aforementioned functionality; or combinations of some or all of the foregoing, such as in a system-on-a-chip.
[0063] The term "code" as used above can include software, firmware, and / or microcode, and can refer to programs, routines, functions, classes, and / or objects. The term "shared processor" covers a single processor that executes some or all of the code from multiple modules. The term "group processor" covers a processor that, in combination with additional processors, executes some or all of the code from one or more modules. The term "shared memory" covers a single memory that stores some or all of the code from multiple modules. The term "group memory" covers memory that, in combination with additional memory, stores some or all of the code from one or more modules. The term "memory" can be a subset of the term "computer-readable medium." The term "computer-readable medium" does not include transient electrical and electromagnetic signals propagating through a medium, and therefore can be considered tangible and non-transient memory. Non-limiting examples of non-transient memory include tangible computer-readable media, including non-volatile memory, magnetic storage devices, and optical storage devices.
[0064] The apparatus and methods described in this application may be implemented, in part or in whole, by one or more computer programs executed by one or more processors. The computer programs include processor-executable instructions stored on at least one non-transitory tangible computer-readable medium. The computer programs may also include and / or depend on stored data.
[0065] A software application (i.e., a software resource) can refer to computer software that instructs a computing device to perform a task. In some examples, a software application may be referred to as an "application," "app," or "program." Example applications include, but are not limited to, system diagnostic applications, system management applications, system maintenance applications, word processing applications, spreadsheet applications, messaging applications, media streaming applications, social networking applications, and game applications.
[0066] Non-transient memory can be a physical device used for temporary or permanent storage of programs (e.g., instruction sequences) or data (e.g., program state information) for use by a computing device. Non-transient memory can be volatile and / or non-volatile addressable semiconductor memory. Examples of non-volatile memory include, but are not limited to, flash memory and read-only memory (ROM) / programmable read-only memory (PROM) / erasable programmable read-only memory (EPROM) / electronically erasable programmable read-only memory (EEPROM) (e.g., commonly used in firmware, such as bootloaders). Examples of volatile memory include, but are not limited to, random access memory (RAM), dynamic random access memory (DRAM), static random access memory (SRAM), phase-change memory (PCM), and magnetic disks or magnetic tapes.
[0067] These computer programs (also referred to as programs, software, software applications, or code) include machine instructions for a programmable processor and can be implemented using high-level procedural and / or object-oriented programming languages and / or assembly / machine languages. As used herein, the terms “machine-readable medium” and “computer-readable medium” refer to any computer program product, non-transitory computer-readable medium, apparatus, and / or device (e.g., disk, optical disk, memory, programmable logic device (PLD)) used to provide machine instructions and / or data to a programmable processor, including machine-readable media that receive machine instructions as machine-readable signals. The term “machine-readable signal” refers to any signal used to provide machine instructions and / or data to a programmable processor.
[0068] Various implementations of the systems and techniques described herein can be implemented in digital electronic and / or optical circuits, integrated circuits, specially designed ASICs (Application-Specific Integrated Circuits), computer hardware, firmware, software, and / or combinations thereof. These various implementations may include implementations in one or more computer programs that can be executed and / or interpreted on a programmable system including at least one programmable processor, which may be dedicated or general-purpose, coupled to receive data and instructions from a storage system, at least one input device, and at least one output device, and to transmit data and instructions to the storage system, at least one input device, and at least one output device.
[0069] The processes and logic described in this specification can be executed by one or more programmable processors (also known as data processing hardware) that execute one or more computer programs to perform functions by manipulating input data and generating output. The processes and logic can also be executed by special-purpose logic circuitry, such as FPGAs (Field-Programmable Gate Arrays) or ASICs (Application-Specific Integrated Circuits). Processors suitable for executing computer programs include, for example, both general-purpose and special-purpose microprocessors, and any one or more processors of any kind of digital computer. Typically, the processor receives instructions and data from read-only memory or random access memory, or both. The basic elements of a computer are a processor for executing instructions and one or more memory devices for storing instructions and data. Typically, a computer will also include one or more mass storage devices for storing data, or operably coupled to receive data from or transfer data to, or both, such as magnetic disks, magneto-optical disks, or optical disks. However, a computer does not necessarily need to have such devices. Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media, and memory devices, including, for example, semiconductor memory devices such as EPROM, EEPROM, and flash memory devices; magnetic disks, such as internal hard disks or removable disks; magneto-optical disks; and CD-ROMs and DVD-ROMs. Processors and memory may be supplemented by or incorporated into dedicated logic circuitry.
[0070] To provide interaction with a user, one or more aspects of this disclosure can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube), LCD (liquid crystal display) monitor, or touchscreen) for displaying information to the user and optionally having a keyboard and pointing device (e.g., a mouse or trackball) through which the user provides input to the computer. Other types of devices can also be used to provide interaction with the user; for example, feedback provided to the user can be any form of sensory feedback, such as visual, auditory, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input. Furthermore, the computer can interact with the user by sending documents to and receiving documents from the device used by the user; for example, by sending web pages to a web browser on the user's client device in response to a request received from a web browser.
[0071] refer to Figure 1In some implementations, system 100 includes vehicle 10, which communicates with remote system 60 via network 40. Network 40 may include a wireless local area network (WLAN) that facilitates communication and interoperability between vehicle 10 and remote system 60 within the vehicle 10's environment. Therefore, network 40 may include wireless fidelity. (e.g., IEEE 802.11), low-rate wireless personal area networks (e.g., IEEE 802.15.4), WiMAX, 3G, 4G, LTE, 5G, digital subscriber line (DSL), Bluetooth, near field communication (NFC), or any other wireless standard, or Ethernet (e.g., IEEE 802.3). System 100 may additionally include one or more access points (APs) (not shown) configured to facilitate wireless communication between vehicle 10 and remote system 60.
[0072] As shown, vehicle 10 and / or remote system 60 execute three-dimensional (3D) graphics system 200. Figure 2 The system 200 is configured to receive sensor data 18 of scene 102 of vehicle 10, and simultaneously infer one or more elements 302 in scene 102. Figure 3A The system generates / predicts a complete 3D scene map of the scene 102 surrounding the vehicle 10 for use by downstream applications of the vehicle 10. As described in further detail below, two or more decoders 230 of the parallel-executed 3D graphics system 200 can pull / share feature embeddings 212 via a cross-attention mechanism. This sharing of feature embeddings 212 provides synergy between the two or more decoders 230 through parallel task execution, which not only improves the performance of the 3D graphics system 200 relative to existing perception systems, but also allows for the real-time generation of an adjacency matrix 262 representing a 3D scene map of the scene 102 surrounding the vehicle 10. Advantageously, the construction of the adjacency matrix 262 allows downstream applications to have a clearer understanding of the scene 102 for safer and more accurate maneuvering, especially in autonomous driving modes.
[0073] In the example shown, the 3D graphics system 200 is implemented within vehicle 10. However, the 3D graphics system 200 can be implemented in any other propulsion system, such as, but not limited to, motorcycles, trucks, off-road vehicles, agricultural equipment, trains, airplanes, etc. Furthermore, although the 3D graphics system 200 is shown as being implemented within vehicle 10, it can be implemented on other computing devices (e.g., computing devices communicating with vehicle 10), such as, but not limited to, smartphones, tablets, smart displays, desktop / laptop computers, smartwatches, smart appliances, or smart glasses / headsets. Vehicle 10 includes data processing hardware 12 and memory hardware 14 for storing instructions, which, when executed on the data processing hardware 12, cause the data processing hardware 12 to perform operations.
[0074] like Figure 1 and 2 As shown, vehicle 10 is configured to receive sensor data 18 detected / captured by sensor system 16. Sensor system 16 may include one or more of the following: a camera, a forward collision mitigation system, radio detection and ranging (RADAR), light detection and ranging (LIDAR) capable of capturing image data, and other external sensors of vehicle 10. Although Figure 1 The sensor system 16 shown is positioned at the front of the vehicle, but it should be understood that the sensor system 16 may include sensors located throughout the vehicle 10. For example, the sensor system 16 can provide 360-degree surround sensing of the environment of the vehicle 10.
[0075] Sensor data 18 may include one or more image frames 104 of a scene 102 located outside the vehicle 10. Notably, the one or more image frames 104 of the scene 102 are two-dimensional (2D). These image frames 104 may capture one or more elements 302 within the image frame 104. As used herein, element 302 may generally refer to dynamic elements (such as pedestrians and vehicles) and static elements (such as traffic lights, road signs, road markings, etc.). Although the image frames 104 are 2D, downstream applications of the vehicle 10 may benefit from a 3D representation of the scene 102 (such as the position and orientation of each element 302 and the relationships between elements 302) to comprehensively plan vehicle maneuvers.
[0076] The remote system 60 (e.g., a server, a cloud computing environment) also includes data processing hardware 62 and memory hardware 64 for storing instructions, which, when executed on the data processing hardware 62, cause the data processing hardware 62 to perform operations. In some examples, the execution of the 3D graphics system 200 is shared between the vehicle 10 and the remote system 60. (See also: Regarding...) Figure 1-4In more detail, the 3D graphics system 200 executing on the vehicle 10 and / or the remote system 60 executes a transformer model 202, which is configured to: receive sensor data 18 including an image frame 104 that captures one or more elements 302 represented in 2D; and generate an adjacency matrix 262 that represents a 3D view of the elements 302 in the current scene 102 of the vehicle 10.
[0077] refer to Figure 2 A 3D scene graph system 200 executing a transformer model 202 is shown. The transformer model 202 is a deep neural network (DNN) that includes a bird's-eye view (BEV) encoder 210, multiple feature-specific decoders 230, 230a-e, configured for parallel execution / processing, a prediction head 240 (also called a prediction network 240), a feature transformation module 250, and a topology network 260. Furthermore, the transformer model 202 of the 3D scene graph system 200 can access a memory buffer 220. The memory buffer may include previously generated feature embeddings 212 corresponding to a previous scene 102 encountered by the vehicle 10. y-1 And can be stored Figure 1 The memory hardware is in 14 and 64.
[0078] As shown, the 3D scene graph system 200 continuously receives / processes sensor data 18 including one or more elements 302 detected by the sensor system 16 to identify one or more elements 302 in the sensor data 18. The BEV encoder 210 receives the 2D sensor data 18 as input and encodes the sensor data 18 to generate a feature embedding sequence 212 representing the elements 302 in the sensor data 18 as output. As described above, the sensor data 18 may include a set of image frames 104. Here, for each image frame 104 of the sensor data 18, the BEV encoder 210 can extract a feature embedding for the current scene 102 of the vehicle 10. In these cases, the BEV encoder 210 can encode the sensor data 18 by projecting it onto the corresponding feature embedding sequence 212 to generate the corresponding feature embedding sequence 212. In some cases, the BEV encoder 210 bases the feature embedding sequence 212 on the previous time step. y-1 To generate feature embedding sequence 212.
[0079] Subsequently, the 3D graphics system 200 may store the feature embedding sequence 212 in a memory buffer 220. Additionally, each feature-specific decoder 230a-230e receives the feature embedding sequence 212 as input and generates a decoded feature embedding sequence 232 as output. As shown, the feature-specific decoders 230a-230e include a 3D sign decoder 230a, a 3D vectorized map decoder 230b, a 3D actor decoder 230c (i.e., dynamic element 302), a 3D traffic light decoder 230d, and a 3D road marking decoder 230e. Each feature-specific decoder 230 may be specifically trained to extract specific features associated with a specific category from the sensor data 18, such as, but not limited to, signs, maps (e.g., road topography), dynamic element 302, traffic lights, road markings, etc. It should be understood that although five (5) feature-specific decoders 230 are shown, this disclosure contemplates that the 3D graphics system 200 may also be implemented using more or fewer decoders 230.
[0080] Brief Reference Figure 4 The diagram illustrates each of feature-specific decoders 230a-230e, wherein each feature-specific decoder 230 is communicatively coupled to each of the other feature-specific decoders 230. One or more of the feature-specific decoders 230a-230e may include a multilayer perceptron (MLP) network. In some cases, each feature-specific decoder 230 may include multiple transformer layers. Here, each transformer layer of each respective feature-specific decoder 230 may include at least one cross-attention head configured to perform inter-feature information transfer of corresponding feature embeddings 212a-212e between the other feature-specific decoders 230a-230e. For example, in each time step processed by each feature-specific decoder 230a-230e, each feature-specific decoder 230 can determine the position estimate of the feature embedding sequence 212, and when the feature embedding sequence 212a of the first feature-specific decoder 230a is within a threshold distance of the feature embedding sequence 212b of the second feature-specific decoder 230b, the feature-specific decoders 230a and 230b can perform cross-attention of the feature embedding sequences 212a and 212b between the transformer layers of the first and second feature-specific decoders 230a and 230b. Thereafter, the feature-specific decoders 230a and 230b can perform channel-by-channel cascading to cascade the feature embedding sequences 212a and 212b, and pass the cascaded feature embedding sequences 212a and 212b to the feedforward network for processing at the next time step. (The last sentence appears to be incomplete and requires further context.) Figure 4 As should be seen, at each time step, the feature-specific decoders 230a-230e can perform this cross-attention mechanism in parallel with each other.
[0081] Refer again Figure 2 The prediction head 240 receives the decoded feature embedding sequences 232a-232e output from the feature-specific decoders 232a-232e as input and processes the decoded feature embedding sequence 232 to convert it into semantic features 242. For example, the semantic features 242 may include one or more of the following: category, width, height, position (e.g., xyz coordinates), and orientation of each of one or more elements 302 in scene 102. In some cases, the prediction head 240 includes an MLP network. For example, both the regressor and classifier of the prediction head 240 may be implemented as MLP networks, wherein the regressor MLP network is configured to generate regression predictions that indicate the semantic features 242 of the position and orientation of each element 302, while the classifier MLP network is configured to generate classification predictions of the types of elements 302 present in scene 102 (e.g., dynamic elements such as pedestrians and cars, and static elements such as traffic lights, road signs, and road markings). Similarly, the feature transformation module 250 receives the decoded feature embedding sequences 232a-232e output from the feature-specific decoders 232a-232e as input, and processes the decoded feature embedding sequences 232 to transform them into transformed features 252. Like the prediction head 240, the feature transformation module 250 can be implemented as an MLP network.
[0082] Subsequently, the topology network 260 can receive one or more of the decoded feature embedding sequence 232, semantic features 242, and transformed features 252 as input, and generate an adjacency matrix 262 representing a 3D view of the current scene 102 as output. Here, the topology network 250 is trained to process the input feature embeddings 232, semantic features 242, and transformed features 252 to determine which elements 302 in scene 102 are semantically strongly connected. For example, the topology network 260 can predict the confidence of the relationship between each element 302 in scene 102 and other elements 302, and generate an adjacency matrix 262 representing a 3D view of the current scene 102 of vehicle 10 based on the predicted confidence of each pair of elements 302. In other words, the adjacency matrix 262 predicts the strength of the relationship between at least two elements 302 in scene 102 of vehicle 10. Here, the topology network 260 can infer topological information about the 3D view of the current scene based on the predicted confidence of the association between each pair of elements 302 in scene 102.
[0083] refer to Figure 3A and 3B The adjacency matrix 262 shows an example view of 2D scene 102 and its corresponding 3D view of scene 1012. (See also: Special Reference) Figure 3AAn image frame 104 of the current scene 102 of vehicle 10 is shown in 2D form. In image frame 104, multiple elements 302a-302o are shown. Specifically, the 3D graphics system 200 can identify that scene 102 includes dynamic elements 302a, 302g, 302h, 302i, and 302j corresponding to the car; static elements 302b and 302f corresponding to traffic lights; static elements 302c and 302l corresponding to road signs; static elements 302d, 302e, and 302o corresponding to lanes; and static elements 302k, 302m, and 302n corresponding to road markings. After processing image frame 104 of the current scene 102, the 3D graphics system 200 generates... Figure 3B The adjacency matrix 262 shown is the output. Here, the adjacency matrix 262 includes each of the same elements 302a-302o, but each of the elements 302a-302o is represented as a bird's-eye view 3D perspective view of the vehicle 10. In addition to the strength of the relationship between each pair of elements 302, the adjacency matrix 262 may also include the relative size, position, and orientation of each of the elements 302a-302o relative to the vehicle 10 and relative to the other elements 302a-302o. It is worth noting that the adjacency matrix 262 can be transmitted to downstream applications of the vehicle 10 (e.g., steering control, braking control, etc.) to plan the maneuvering of the vehicle 10 as it travels along a road.
[0084] Figure 5 This includes a flowchart of an example operation layout for method 500 for real-time 3D scene graph creation. Method 500 can be referenced. Figure 1-4 Describe it. Data processing hardware (e.g., Figure 1 Data processing hardware 12, 62) can execute data stored in memory hardware (e.g., Figure 1 Instructions on the memory hardware (14, 64) are used to execute the example operation arrangement of method 500.
[0085] At operation 502, method 500 includes receiving sensor data 18 corresponding to a two-dimensional (2D) image of the current scene 102 of vehicle 10 as input to transformer model 202. At operation 504, method 500 includes encoding the sensor data 18 using a bird's-eye view encoder 210 to generate a corresponding feature embedding sequence 212. Here, the corresponding feature embedding sequence 212 corresponds to a three-dimensional (3D) representation of the current scene 102 of vehicle 10.
[0086] At operation 506, method 500 further includes using two or more feature-specific decoders 230, 230a-e executed in parallel to decode the feature embedding sequence 212. Method 500 also includes using a prediction network 240 at operation 508 to process the decoded feature embedding sequence 232 into semantic features 242. At operation 510, method 500 further includes using a topological network 260 to process the semantic features 242 and the decoded feature embedding sequence 232 to generate an adjacency matrix 262 representing a 3D view of the current scene 102 of vehicle 10.
[0087] Figure 6 This includes a flowchart of an example operation layout for method 600 for real-time 3D scene graph creation. Method 600 can be referenced. Figure 1-4 Describe it. Data processing hardware (e.g., Figure 1 Data processing hardware 12, 62) can execute data stored in memory hardware (e.g., Figure 1 Instructions on the memory hardware (14, 64) are used to execute the example operation arrangement of method 600.
[0088] Method 600 includes receiving, at operation 602, sensor data 18 corresponding to a two-dimensional (2D) image of the current scene 102 of the vehicle 10 as input to a transformer model 202. Here, the 2D image includes at least two elements 302. At operation 604, method 600 includes encoding the sensor data 18 using a bird's-eye view encoder 210 to generate a corresponding feature embedding sequence 212. Here, the corresponding feature embedding sequence 212 corresponds to a three-dimensional (3D) representation of the current scene 102 of the vehicle 10.
[0089] At operation 606, method 600 further includes decoding the feature embedding sequence 212 using two or more feature-specific decoders 230, 230a-e executed in parallel. At operation 608, method 600 includes processing the decoded feature embedding sequence 232 using a topology network 260 to generate an adjacency matrix 262 representing a 3D view of the current scene 102 of vehicle 10. Here, the adjacency matrix 262 predicts the strength of the relationship between at least two elements 302.
[0090] Several embodiments have been described. However, it should be understood that various modifications can be made without departing from the spirit and scope of this disclosure. Therefore, other embodiments are within the scope of the following claims.
[0091] The foregoing description is provided for illustrative purposes only. It is not intended to be exhaustive or limiting of this disclosure. Individual elements or features of a particular configuration are generally not limited to that particular configuration, but are (where applicable) interchangeable and can be used in selected configurations, even if they are not specifically shown or described. These elements or features can also be varied in many ways. Such variations should not be considered a departure from this disclosure, and all such modifications are intended to be included within the scope of this disclosure.
Claims
1. A computer-implemented method executed on data processing hardware, the method causing the data processing hardware to perform operations including: Receive sensor data corresponding to a two-dimensional (2D) image of the current scene of the vehicle as input to the converter model; The sensor data is encoded using a bird's-eye view encoder to generate a corresponding feature embedding sequence, which corresponds to a three-dimensional (3D) representation of the current scene of the vehicle. The feature embedding sequence is decoded using two or more feature-specific decoders that are executed in parallel; The decoded feature embedding sequence is processed using a predictive network to convert the decoded feature embedding sequence into semantic features; and The semantic features and the decoded feature embedding sequence are processed using a topological network to generate an adjacency matrix representing a 3D view of the current scene of the vehicle.
2. The method of claim 1, wherein each of the two or more feature-specific decoders executed in parallel comprises a plurality of transformer layers.
3. The method of claim 2, wherein each converter layer includes a cross-attention head.
4. The method of claim 2, wherein decoding the feature embedding sequence using the two or more feature-specific decoders executed in parallel comprises performing cross-attention of the feature embedding sequence between corresponding transformer layers of the two or more feature-specific decoders.
5. The method of claim 1, wherein the sensor data comprises a set of image frames.
6. The method of claim 5, wherein the operation further comprises: For each image frame in the set of image frames, extract the feature embedding of the current scene.
7. The method of claim 6, wherein encoding the sensor data using a bird's-eye view encoder to generate a corresponding feature embedding sequence comprises projecting the sensor data into the corresponding feature embedding sequence.
8. The method of claim 1, wherein the prediction network comprises a multilayer perceptron network.
9. The method of claim 1, wherein the 2D image of the current scene of the vehicle comprises at least two elements.
10. The method of claim 9, wherein the adjacency matrix predicts the strength of the relationship between the at least two elements.