Method, apparatus, device, storage medium and program product for generating trajectories

By combining visual language models and trajectory planning models to generate target trajectories for autonomous vehicles, the problems of high computational complexity and insufficient real-time accuracy of trajectories in existing technologies are solved, achieving more efficient and safer trajectory planning and improving the driving performance of autonomous vehicles.

CN122306101APending Publication Date: 2026-06-30BEIJING VOYAGER TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
BEIJING VOYAGER TECH CO LTD
Filing Date
2024-12-30
Publication Date
2026-06-30

Smart Images

  • Figure CN122306101A_ABST
    Figure CN122306101A_ABST
Patent Text Reader

Abstract

According to embodiments of this disclosure, a method, apparatus, device, storage medium, and program product for generating trajectories are provided. The method includes: determining visual features of a set of perceived images captured by an autonomous vehicle; acquiring a reference trajectory generated by a visual language model based on the visual features; and determining a target trajectory of the autonomous vehicle by a trajectory planning model based on the visual features and the reference trajectory, wherein the trajectory planning model is an end-to-end model. Based on this approach, embodiments of this disclosure can improve the quality of the generated trajectory.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The exemplary embodiments disclosed herein generally relate to the field of computers, and particularly to methods, apparatus, devices, computer-readable storage media, and computer program products for generating trajectories. Background Technology

[0002] With the rapid development of autonomous driving technology, autonomous vehicles are able to plan their routes in various traffic environments. Autonomous vehicles can use perception technology to identify their surroundings and use machine learning algorithms to optimize paths and make decisions. Summary of the Invention

[0003] In a first aspect of this disclosure, a method for generating a trajectory is provided. The method includes: determining visual features of a set of perceived images captured by an autonomous vehicle; acquiring a reference trajectory generated by a visual language model based on the visual features; and determining a target trajectory of the autonomous vehicle by a trajectory planning model based on the visual features and the reference trajectory, wherein the trajectory planning model is an end-to-end model.

[0004] In a second aspect of this disclosure, an apparatus for generating a trajectory is provided. The apparatus includes: a first determining module configured to determine visual features of a set of perceived images captured by an autonomous vehicle; an acquiring module configured to acquire a reference trajectory generated by a visual language model based on the visual features; and a second determining module configured to determine a target trajectory of the autonomous vehicle by a trajectory planning model based on the visual features and the reference trajectory, wherein the trajectory planning model is an end-to-end model.

[0005] In a third aspect of this disclosure, a computing device is provided. The device includes at least one processing unit and at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit. When executed by the at least one processing unit, the instructions cause the device to perform the method of the first aspect.

[0006] In a fourth aspect of this disclosure, a computer-readable storage medium is provided. The computer-readable storage medium stores a computer program that can be executed by a processor to implement the method of the first aspect.

[0007] In a fifth aspect of this disclosure, a computer program product is provided. The computer program product includes computer-executable instructions that, when executed by a processor, implement the method of the first aspect.

[0008] It should be understood that the content described in this summary section is not intended to limit the key or essential features of the embodiments of this disclosure, nor is it intended to restrict the scope of this disclosure. Other features of this disclosure will become readily apparent from the following description. Attached Figure Description

[0009] The above and other features, advantages, and aspects of the embodiments of this disclosure will become more apparent from the accompanying drawings and the following detailed description. In the drawings, the same or similar reference numerals denote the same or similar elements, wherein:

[0010] Figure 1 A schematic diagram of an example environment in which embodiments of the present disclosure can be implemented is shown;

[0011] Figure 2 A flowchart illustrating an example process for generating a trajectory according to some embodiments of the present disclosure is shown;

[0012] Figure 3 A flowchart illustrating an example process for generating a trajectory according to some embodiments of this disclosure is shown;

[0013] Figure 4 A schematic structural block diagram of an example device for generating trajectories according to certain embodiments of the present disclosure is shown; and

[0014] Figure 5 A block diagram of an apparatus capable of implementing several embodiments of the present disclosure is shown. Detailed Implementation

[0015] Embodiments of this disclosure will now be described in more detail with reference to the accompanying drawings. While some embodiments of this disclosure are shown in the drawings, it should be understood that this disclosure can be implemented in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided to provide a more thorough and complete understanding of this disclosure. It should be understood that the accompanying drawings and embodiments of this disclosure are for illustrative purposes only and are not intended to limit the scope of protection of this disclosure.

[0016] It should be noted that the headings of any section / subsection provided herein are not limiting. Various embodiments are described throughout this document, and embodiments of any type may be included under any section / subsection. Furthermore, embodiments described in any section / subsection may be combined in any way with any other embodiments described in the same section / subsection and / or different sections / subsections.

[0017] In the description of embodiments of this disclosure, the term "comprising" and similar terms should be understood as open-ended inclusion, i.e., "including but not limited to". The term "based on" should be understood as "at least partially based on". The term "one embodiment" or "the embodiment" should be understood as "at least one embodiment". The term "some embodiments" should be understood as "at least some embodiments". Other explicit and implicit definitions may also be included below. The terms "first", "second", etc., may refer to different or the same objects. Other explicit and implicit definitions may also be included below.

[0018] The embodiments of this disclosure may involve user data, data acquisition, and / or use. All of these aspects comply with applicable laws, regulations, and relevant provisions. In the embodiments of this disclosure, all data collection, acquisition, processing, manipulation, forwarding, and use are conducted with the user's knowledge and confirmation. Accordingly, in implementing the embodiments of this disclosure, the type, scope of use, and usage scenarios of any data or information that may be involved should be communicated to the user and their authorization obtained in accordance with relevant laws and regulations through appropriate means. The specific methods of notification and / or authorization may vary depending on the actual situation and application scenario, and the scope of this disclosure is not limited in this respect.

[0019] In this specification and the embodiments, any processing of personal information will be carried out only under the premise of legality (such as obtaining the consent of the personal information subject, or being necessary for the performance of a contract), and will only be carried out within the scope stipulated or agreed upon. A user's refusal to process personal information other than that necessary for basic functions will not affect the user's use of basic functions.

[0020] As briefly mentioned earlier, there are currently two main methods for planning driving trajectories using autonomous vehicles. The first is trajectory planning technology implemented using end-to-end systems. The second is trajectory planning technology implemented using visual language models. The former relies on mapping relationships learned from data by deep neural networks. The latter utilizes visual language models with a larger parameter scale and more complex computational mechanisms. In summary, optimization based on these two methods can enable autonomous vehicles to obtain higher-quality trajectories.

[0021] Embodiments of this disclosure propose a scheme for generating trajectories. The scheme includes: determining visual features of a set of perceived images captured by an autonomous vehicle; acquiring a reference trajectory generated by a visual language model based on the visual features; and determining a target trajectory for the autonomous vehicle by a trajectory planning model based on the visual features and the reference trajectory, wherein the trajectory planning model is an end-to-end model.

[0022] In this way, embodiments of the present disclosure generate target trajectories based on shared visual features through visual language models and trajectory planning models, which can improve the efficiency of visual language models and trajectory planning models in processing visual features, thereby improving the quality of the generated trajectories and enabling autonomous vehicles to drive more safely.

[0023] The following section provides a detailed description of various example implementations of this scheme, with reference to the accompanying drawings.

[0024] Example Environment

[0025] Figure 1A schematic diagram of an example environment 100 in which embodiments of the present disclosure can be implemented is shown. For example... Figure 1 As shown, example environment 100 may include electronic device 110. Electronic device 110 may be deployed on autonomous vehicle 120.

[0026] In some embodiments, electronic device 110 communicates with server 130 to provide services for generating trajectories. Electronic device 110 can be any type of mobile terminal, fixed terminal, or portable terminal, including mobile phones, desktop computers, laptop computers, notebook computers, netbook computers, tablet computers, media computers, multimedia tablets, handheld computers, portable gaming terminals, VR / AR devices, personal communication system (PCS) devices, personal navigation devices, personal digital assistants (PDAs), audio / video players, digital cameras / camcorders, positioning devices, television receivers, radio receivers, e-book devices, gaming devices, or any combination thereof, including accessories and peripherals of these devices or any combination thereof. In some embodiments, electronic device 110 can also support any type of user-facing interface (such as "wearable" circuitry).

[0027] Server 130 can be a standalone physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server providing basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks, and big data and artificial intelligence platforms. Server 130 may include, for example, computing systems / servers such as mainframes, edge computing nodes, computing devices in a cloud environment, etc. Server 130 can provide backend services for electronic device 110.

[0028] A communication connection can be established between server 130 and electronic device 110. This communication connection can be established via wired or wireless means. The communication connection may include, but is not limited to, Bluetooth, mobile network, Universal Serial Bus (USB), and Wireless Fidelity (WiFi) connections; the embodiments of this disclosure are not limited in this respect. In the embodiments of this disclosure, server 130 and electronic device 110 can achieve signaling interaction through the communication connection between them.

[0029] It should be understood that the structure and function of the various elements in environment 100 are described for illustrative purposes only and do not imply any limitation on the scope of this disclosure.

[0030] The following description will continue with reference to the accompanying drawings, which will provide some exemplary embodiments of this disclosure.

[0031] Example process

[0032] The following combination Figure 2 and Figure 3 This describes the specific process of generating the trajectory. Figure 2 A flowchart of an example process 200 for generating a trajectory according to some embodiments of the present disclosure is shown. Figure 3 A flowchart of an example process 300 for generating a trajectory according to some embodiments of the present disclosure is shown. Process 300 can be implemented at electronic device 110. Reference is made below. Figure 1 and Figure 2 To describe process 300.

[0033] like Figure 3 As shown, in box 310, electronic device 110 determines visual features 235 of a set of perceived images 210 captured by autonomous vehicle 120.

[0034] In some embodiments, a set of perceived images 210 may be acquired by surround-view cameras deployed on the autonomous vehicle 120. Multiple surround-view cameras may be configured. Alternatively, the set of perceived images 210 may also be images acquired by multiple surround-view cameras within a predetermined time period. The predetermined time period may be, for example, one minute. The length of the predetermined time period can be adaptively adjusted according to actual circumstances.

[0035] In some embodiments, the electronic device 110 may use a model capable of processing image understanding tasks to identify a set of perceived images 210, thereby determining visual features 235 of the set of perceived images 210. The visual features 235 may indicate key environmental elements in the perceived images 210. Key environmental elements may include, for example, foreground object information such as vehicles or pedestrians, background map information such as lane lines, and environmental information such as traffic lights or weather conditions.

[0036] As an example, electronic device 110 may determine visual feature 235 based on the following process:

[0037] First, the electronic device 110 encodes a set of perceptual images 210 to determine the encoding features 225 of the set of perceptual images 210. The encoding features 225 are features obtained by the electronic device 110 after dimensionality reduction of the set of perceptual images 210, and can indicate the image information in the set of perceptual images 210. The process of encoding the set of perceptual images 210 can be implemented using models such as convolutional neural networks or visual encoders 220.

[0038] Subsequently, the electronic device 110 updates a set of collector query vectors based on the encoded features 225 to determine visual features 235. The collector query vectors indicate information of interest within a set of perceived images 210. By interacting with the encoded features 225, the electronic device 110 obtains the visual features 235 corresponding to the collector query vectors. The visual features 235 determined in this manner represent a further understanding of the perceived images 210, facilitating trajectory generation.

[0039] Specifically, the electronic device 110 may first provide the first converter network with the coding features 225 and a set of initial collector query vectors to determine a set of intermediate collector query vectors.

[0040] Before the above steps, the electronic device 110 needs to initialize a set of collector query vectors. It should be understood that the image information contained in the perceived image 210 varies depending on the scene; therefore, the collector query vectors corresponding to the perceived image 210 in different scenes are also different. By initializing the collector query vectors and then updating them using the perceived image 210, the electronic device 110 can obtain more targeted visual features 235 based on the collector query vectors.

[0041] After providing the encoded features 225 and a set of initial collector query vectors to the first converter network, the electronic device 110 can obtain a set of intermediate collector query vectors. The set of intermediate collector query vectors can, for example, indicate features representing environmental information in the perceived image 210.

[0042] Additionally, cross-attention and self-attention mechanisms can be added to the first converter network, enabling the electronic device 110 to extract features representing environmental information from the perceived image 210 more accurately.

[0043] Subsequently, electronic device 110 provides the second converter network with a set of intermediate collector query vectors, a set of task query vectors, and encoded features 225 to determine visual features 235. The set of task query vectors indicates specific information of interest within a set of perceived images 210. By simultaneously providing the second converter network with both an intermediate collector query vector and a set of task query vectors, electronic device 110 is able to extract more targeted features from the perceived images 210.

[0044] For example, a set of task query vectors may include multiple different sets of task query vectors. When a set of task query vectors includes multiple different sets of task query vectors, the electronic device 110 can sequentially provide different task query vectors to the second converter network. The following example, using a set of task query vectors including a set of object detection query vectors and a set of map understanding query vectors, further illustrates this step. Specifically, the set of object detection query vectors indicates foreground object information in a set of perceived images 210, and the set of map understanding query vectors indicates background map information in a set of perceived images 210.

[0045] Electronic device 110 first provides the second converter network with a set of intermediate collector query vectors, a set of object detection query vectors, and a set of encoded features 225. At this point, electronic device 110 can obtain an updated set of intermediate collector query vectors. Then, electronic device 110 provides the second converter network with another updated set of intermediate collector query vectors, a set of map understanding query vectors, and a set of encoded features 225, at which point electronic device 110 can obtain visual features 235. The obtained visual features 235 may include a set of features from the perceived image 210 corresponding to foreground object information, background map information, and environmental information.

[0046] Additionally, a self-attention mechanism can be added to the second converter network, enabling the electronic device 110 to extract features representing foreground object information and features representing map background information more accurately from the perceived image 210.

[0047] Additionally, a temporal query mechanism can be added to the second converter network to integrate temporal information of features representing foreground object information and features representing map background information, respectively. By adding the temporal query mechanism, the temporal consistency and stability of the visual features 235 obtained by the electronic device 110 can be improved.

[0048] In some embodiments, the first converter network and the second converter network may be implemented by a query converter 230 (Q-Former).

[0049] In box 320, electronic device 110 acquires reference trajectory 255 generated by visual language model 250 based on visual features 235.

[0050] In some embodiments, the reference trajectory 255 may be generated based on the following process:

[0051] First, the electronic device 110 sends first guidance information to the visual language model 250, instructing the visual language model 250 to determine an intermediate trajectory from a set of candidate trajectories 240. The set of candidate trajectories 240 may include real-time trajectories and trajectories related to historical trajectory data, thereby improving the real-time performance and accuracy of the generated reference trajectory 255. In this paper, the trajectory related to historical trajectory data is referred to as the first candidate trajectory, and the real-time trajectory is referred to as the second candidate trajectory.

[0052] Specifically, as an example, the first candidate trajectory can be generated based on the following process: the electronic device 110 generates the first candidate trajectory by clustering training trajectory data. Multiple first candidate trajectories generated in this way can indicate the direction of change of different types of trajectories in historical trajectories. The clustering method can be, for example, the k-means algorithm.

[0053] As an example, the second candidate trajectory can be generated based on the following process: electronic device 110 generates the second candidate trajectory based on the current state and historical state of autonomous vehicle 120. The current state and historical state of autonomous vehicle 120 can be, for example, the velocity and acceleration of autonomous vehicle 120. Electronic device 110 can, for example, utilize a multi-layer perceptron (MLP) to generate the second candidate trajectory 240 based on the current state and historical state of autonomous vehicle 120.

[0054] The electronic device 110 inputs a set of candidate trajectories 240, consisting of multiple first candidate trajectories and second candidate trajectories, into the visual language model 250, so that the visual language model 250 can obtain an intermediate trajectory from it. This can improve the real-time performance of the generated trajectory and improve the accuracy of the generated trajectory based on the changing direction of multiple first candidate trajectories, thereby improving the security of the generated trajectory.

[0055] Additionally, before sending the first guidance information to the visual language model 250, the electronic device 110 can also process the visual features 235, a set of candidate trajectories 240, and the first guidance information to facilitate recognition by the visual language model 250. For example, the electronic device 110 can convert the visual features 235 into visual tags that can be recognized by the visual language model 250, convert the set of candidate trajectories 240 into trajectory tags that can be recognized by the visual language model 250, and convert the first guidance information into text tags that can be recognized by the visual language model 250. Specifically, the electronic device 110 can be implemented using a visual converter, a trajectory converter, and a trajectory thought chain.

[0056] As an example, the first guidance message could be a prompt template based on a set of candidate trajectories 240. For instance, the first guidance message could be "Here is candidate trajectory 240". <traj1> 、 <traj2>...<Trajk+1> Please select the best trajectory for autonomous vehicle 120 based on the current scenario. Here, k represents the number of the first candidate trajectories, and k+1 represents the number of trajectories in a set of candidate trajectories 240. The current scenario used by the visual language model 250 to select the best trajectory from a set of candidate trajectories 240 is the scenario represented by visual feature 235.

[0057] Furthermore, when selecting the best trajectory from a set of candidate trajectories 240 as an intermediate trajectory, the visual language model 250 can evaluate the safety and adaptability of each candidate trajectory 240 to the current scene, and then select the candidate trajectory 240 with the highest score as the intermediate trajectory.

[0058] Then, the electronic device 110 sends a second guidance message to the visual language model 250 to instruct the visual language model 250 to generate a reference trajectory 255 based on the intermediate trajectory.

[0059] In some embodiments, the electronic device 110 can decompose an intermediate trajectory into a set of waypoints, and then use a visual language model 250 to predict and generate a reference trajectory 255 based on the set of waypoints. In this way, the intermediate trajectory can be optimized, resulting in a reference trajectory 255 with higher accuracy and safety. Each waypoint in the set corresponds to a position coordinate, which represents the position of the autonomous vehicle 120 at a specific point in time. The set of waypoints can, for example, be arranged in a waypoint sequence.

[0060] After the electronic device 110 completes the decomposition of the intermediate trajectory and obtains a set of waypoints, the electronic device 110 sends a second guidance message to the visual language model 250 to instruct the visual language model 250 to generate a reference trajectory 255 based on the set of waypoints.

[0061] Additionally, before sending the second guidance information to the visual language model 250, the electronic device 110 can also process a set of waypoints and the first guidance information to facilitate recognition by the visual language model 250. For example, the electronic device 110 can convert a set of waypoints into waypoint markers that can be recognized by the visual language model 250, and can convert the second guidance information into text markers that can be recognized by the visual language model 250. Specifically, the electronic device 110 can obtain waypoint markers by encoding a set of waypoints.

[0062] As an example, the second guidance message could be a prompt template based on a set of waypoints. For instance, the second guidance message could be "Refer to the selected trajectory."<Point 1> , <point2>...<Point n> Please provide a reference trajectory 255 for autonomous vehicle 120. Here, n represents the number of waypoints in a set of waypoints. The visual language model 250 can generate the reference trajectory 255 based on visual features 235 and a set of waypoints.

[0063] Additionally, the visual language model 250 can also generate a reference trajectory 255 based on a set of waypoints, using visual features 235 and the motion features of the autonomous vehicle 120. The motion features of the autonomous vehicle 120 may include, but are not limited to, the speed and acceleration of the autonomous vehicle 120.

[0064] In this manner, the visual language model 250 generates a reference trajectory 255 based on visual features 235 using a thought chain. This process combines environmental perception, safety assessment, and the motion state of the autonomous vehicle 120 to optimize the reference trajectory 255 from coarse to fine. This improves the quality of the generated trajectory in terms of accuracy and safety, thereby enabling the autonomous vehicle 120 to drive more safely.

[0065] In box 330, electronic device 110 determines the target trajectory 265 of autonomous vehicle 120 based on visual features 235 and reference trajectory 255 using trajectory planning model 260. Trajectory planning model 260 is an end-to-end model.

[0066] In some embodiments, electronic device 110 may determine the target trajectory 265 of autonomous vehicle 120 based on the following process:

[0067] First, the electronic device 110 determines the portion of the trajectory associated with the target time based on the reference trajectory 255, as the initial trajectory.

[0068] As mentioned earlier, the visual language model 250 has a large parameter scale and a complex computation mechanism, resulting in a low frequency of generating the reference trajectory 255. In other words, the reference trajectory 255 generated by the visual language model 250 may not be real-time. In some embodiments, the electronic device 110 can process the reference trajectory 255 to generate the target trajectory 265.

[0069] As an example, electronic device 110 can construct a trajectory storage unit for storing trajectories, to store the reference trajectory 255 generated by visual language model 250. The trajectory storage unit can be stored, for example, in a device with storage capabilities such as a memory.

[0070] In some embodiments, before obtaining the initial trajectory from the reference trajectory 255, the electronic device 110 also needs to determine the real-time performance of the reference trajectory 255. As an example, the electronic device 110 can retrieve the latest reference trajectory 255 from the trajectory storage unit, and then determine the initial trajectory from the reference trajectory 255.

[0071] It should be understood that the reference trajectory 255 can indicate the trajectory within a predetermined time period, for example, the reference trajectory 255 can indicate the trajectory within 5 seconds. For the current moment, the trajectory in the reference trajectory 255 corresponding to the period before the current moment is invalid, while the trajectory in the reference trajectory 255 corresponding to the period after the current moment is valid and usable. Therefore, the electronic device 110 can divide the reference trajectory 255 based on time, determine the portion of the trajectory associated with the target moment, and thus determine the initial trajectory. Here, the current moment is the target moment.

[0072] Specifically, when generating the reference trajectory 255, the visual language model 250 can add a timestamp to the reference trajectory 255. The timestamp can be, for example, the moment corresponding to a set of perceptual images 210 used to generate the reference trajectory 255. Then, the electronic device 110 can determine the reference point corresponding to the current moment in the reference trajectory 255 based on the timestamp, and then extract the trajectory between the reference point and the endpoint of the reference trajectory 255 to obtain the initial trajectory.

[0073] Then, the electronic device 110 uses the trajectory planning model 260 to generate the target trajectory 265 of the autonomous vehicle 120 at the target time based on the visual features 235 and the initial trajectory.

[0074] In some embodiments, the trajectory planning model 260 can fuse visual features 235 with the initial trajectory to obtain the target trajectory 265 at the target time.

[0075] Additionally, a cross-attention mechanism can be added to the trajectory planning model 260 to enable deep interaction between the visual features 235 and the initial trajectory, thereby improving the safety of the target trajectory 265.

[0076] Additionally, the trajectory planning model 260 can also generate the target trajectory 265 of the autonomous vehicle 120 at the target time based on the visual features 235 and the motion features and initial trajectory of the autonomous vehicle 120.

[0077] Specifically, during the process of generating the target trajectory 265 by the trajectory planning model 260, the trajectory planning model 260 can be optimized by fusing visual features 235 with the initial trajectory. That is, the trajectory planning module 260's ability to understand complex scenarios is improved, as is its ability to predict the behavioral intentions of traffic participants. The target trajectory 265 generated by this trajectory planning model 260 has improved accuracy, flexibility, and safety.

[0078] Example devices and equipment

[0079] Figure 4 A schematic structural block diagram of a trajectory generation apparatus 400 according to some embodiments of the present disclosure is shown. Apparatus 400 may be implemented as or included in electronic device 110. Various modules / components in apparatus 400 may be implemented by hardware, software, firmware, or any combination thereof.

[0080] As shown in the figure, the device 400 includes a first determining module 410 configured to determine visual features of a set of perceived images captured by an autonomous vehicle; an acquisition module 420 configured to acquire a reference trajectory generated by a visual language model based on the visual features; and a second determining module 430 configured to determine the target trajectory of the autonomous vehicle by a trajectory planning model based on the visual features and the reference trajectory, wherein the trajectory planning model is an end-to-end model.

[0081] In some embodiments, determining visual features of a set of perceived images captured by an autonomous vehicle includes: encoding a set of perceived images to determine encoded features of the set of perceived images; and updating a set of collector query vectors based on the encoded features to determine visual features.

[0082] In some embodiments, updating a set of collector query vectors based on encoding features includes: providing encoding features and a set of initial collector query vectors to a first converter network to determine a set of intermediate collector query vectors; and providing a set of intermediate collector query vectors, a set of task query vectors, and encoding features to a second converter network to determine visual features.

[0083] In some embodiments, a set of task query vectors includes: a set of object detection query vectors; and a set of map understanding query vectors.

[0084] In some embodiments, the reference trajectory is generated based on the following process: sending a first guidance message to a visual language model to instruct the visual language model to determine an intermediate trajectory from a set of candidate trajectories; and sending a second guidance message to the visual language model to instruct the visual language model to generate a reference trajectory based on the intermediate trajectory.

[0085] In some embodiments, a set of candidate trajectories includes: a first candidate trajectory generated by clustering training trajectory data; and / or a second candidate trajectory generated based on the current and historical states of the autonomous vehicle.

[0086] In some embodiments, the second guidance information includes a set of waypoints corresponding to the intermediate trajectory.

[0087] In some embodiments, determining the target trajectory of an autonomous vehicle based on visual features and a reference trajectory by a trajectory planning model includes: determining a portion of the trajectory associated with a target time based on the reference trajectory as an initial trajectory; and generating the target trajectory of the autonomous vehicle at the target time using the trajectory planning model based on visual features and the initial trajectory.

[0088] Figure 5 A block diagram of a computing device 500 in which one or more embodiments of the present disclosure may be implemented is shown. It should be understood that... Figure 5 The computing device 500 shown is merely exemplary and should not be construed as limiting the functionality and scope of the embodiments described herein. Figure 5 The computing device 500 shown can be used to implement Figure 1 Electronic devices 110.

[0089] like Figure 5 As shown, computing device 500 is in the form of a general-purpose computing device. Components of computing device 500 may include, but are not limited to, one or more processors or processing units 510, memory 520, storage devices 530, one or more communication units 540, one or more input devices 550, and one or more output devices 560. Processing unit 510 may be a physical or virtual processor and is capable of performing various processes according to programs stored in memory 520. In a multiprocessor system, multiple processing units execute computer-executable instructions in parallel to improve the parallel processing capability of computing device 500.

[0090] Computing device 500 typically includes multiple computer storage media. Such media can be any accessible media that is accessible to computing device 500, including but not limited to volatile and non-volatile media, removable and non-removable media. Memory 520 can be volatile memory (e.g., registers, cache, random access memory (RAM)), non-volatile memory (e.g., read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory), or some combination thereof. Storage device 530 can be removable or non-removable media and can include machine-readable media, such as flash drives, disks, or any other media that can be used to store information and / or data (e.g., training data for training) and can be accessed within computing device 500.

[0091] The computing device 500 may further include additional removable / non-removable, volatile / non-volatile storage media. Although not explicitly stated... Figure 5 As shown, disk drives for reading from or writing to removable, non-volatile disks (e.g., "floppy disks") and optical disk drives for reading from or writing to removable, non-volatile optical disks can be provided. In these cases, each drive can be connected to a bus (not shown) via one or more data media interfaces. Memory 520 may include computer program product 525 having one or more program modules configured to perform various methods or actions of various embodiments of this disclosure.

[0092] The communication unit 540 enables communication with other computing devices via a communication medium. Additionally, the functionality of the components of the computing device 500 can be implemented as a single computing cluster or multiple computing machines capable of communicating via communication connections. Therefore, the computing device 500 can operate in a networked environment using logical connections to one or more other servers, networked personal computers (PCs), or another network node.

[0093] Input device 550 can be one or more input devices, such as a mouse, keyboard, trackball, etc. Output device 560 can be one or more output devices, such as a monitor, speaker, printer, etc. Computing device 500 can also communicate as needed with one or more external devices (not shown) via communication unit 540. These external devices, such as storage devices, display devices, etc., can communicate with one or more devices that enable user interaction with computing device 500, or with any device that enables computing device 500 to communicate with one or more other computing devices (e.g., network card, modem, etc.). Such communication can be performed via input / output (I / O) interfaces (not shown).

[0094] According to an exemplary implementation of this disclosure, a computer-readable storage medium is provided that stores computer-executable instructions thereon, wherein the computer-executable instructions are executed by a processor to implement the methods described above. According to an exemplary implementation of this disclosure, a computer program product is also provided, which is tangibly stored on a non-transitory computer-readable medium and includes computer-executable instructions, which are executed by a processor to implement the methods described above.

[0095] Various aspects of this disclosure are described herein with reference to flowchart illustrations and / or block diagrams of methods, apparatuses, devices, and computer program products implemented according to this disclosure. It should be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer-readable program instructions.

[0096] These computer-readable program instructions can be provided to a processing unit of a general-purpose computer, a special-purpose computer, or other programmable data processing apparatus to produce a machine such that, when executed by the processing unit of the computer or other programmable data processing apparatus, they create means for implementing the functions / actions specified in one or more blocks of the flowchart and / or block diagram. These computer-readable program instructions can also be stored in a computer-readable storage medium that causes a computer, programmable data processing apparatus, and / or other device to operate in a particular manner. Thus, the computer-readable medium storing the instructions comprises an article of manufacture that includes instructions for implementing aspects of the functions / actions specified in one or more blocks of the flowchart and / or block diagram.

[0097] Computer-readable program instructions can be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable data processing apparatus, or other device to produce a computer-implemented process, thereby causing the instructions that execute on the computer, other programmable data processing apparatus, or other device to perform the functions / actions specified in one or more boxes of a flowchart and / or block diagram.

[0098] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this disclosure. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of an instruction, which contains one or more executable instructions for implementing the specified logical function. In some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutive blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, may be implemented using a dedicated hardware-based system that performs the specified function or action, or using a combination of dedicated hardware and computer instructions.

[0099] Various implementations of this disclosure have been described above. These descriptions are exemplary and not exhaustive, nor are they limited to the disclosed implementations. Many modifications and variations will be apparent to those skilled in the art without departing from the scope and spirit of the described implementations. The terminology used herein is chosen to best explain the principles, practical applications, or improvements to technology in the market, or to enable others skilled in the art to understand the various implementations disclosed herein. < / traj1>

Claims

1. A method for generating a trajectory, comprising: Identify the visual features of a set of perceived images captured by an autonomous vehicle; Obtain the reference trajectory generated by the visual language model based on the visual features; as well as The target trajectory of the autonomous vehicle is determined by the trajectory planning model based on the visual features and the reference trajectory. The trajectory planning model is an end-to-end model.

2. The method of claim 1, wherein determining the visual features of a set of perceived images captured by the autonomous vehicle comprises: Encode the set of perceptual images to determine the encoded features of the set of perceptual images; as well as Based on the encoded features, a set of collector query vectors is updated to determine the visual features.

3. The method of claim 2, wherein updating a set of collector query vectors based on the encoded features comprises: The encoded features and a set of initial collector query vectors are provided to the first converter network to determine a set of intermediate collector query vectors; as well as The second converter network is provided with the set of intermediate collector query vectors, the set of task query vectors, and the encoded features to determine visual features.

4. The method according to claim 3, wherein the set of task query vectors includes: A set of object detection query vectors; A set of map understanding query vectors.

5. The method of claim 1, wherein the reference trajectory is generated based on the following process: Send first guidance information to the visual language model to instruct it to determine an intermediate trajectory from a set of candidate trajectories; and A second guiding message is sent to the visual language model to instruct the visual language model to generate the reference trajectory based on the intermediate trajectory.

6. The method of claim 5, wherein the set of candidate trajectories comprises: The first candidate trajectory generated by clustering training trajectory data; and / or The second candidate trajectory is generated based on the current and historical states of the autonomous vehicle.

7. The method of claim 5, wherein the second guidance information includes a set of waypoints corresponding to the intermediate trajectory.

8. The method of claim 1, wherein determining the target trajectory of the autonomous vehicle by a trajectory planning model based on the visual features and the reference trajectory comprises: Based on the reference trajectory, the trajectory portion associated with the target time is determined as the initial trajectory; as well as The trajectory planning model is used to generate the target trajectory of the autonomous vehicle at the target time based on the visual features and the initial trajectory.

9. An apparatus for generating a trajectory, comprising: The first determining module is configured to determine the visual features of a set of perceived images captured by the autonomous vehicle. The acquisition module is configured to acquire a reference trajectory generated by the visual language model based on the visual features; as well as The second determining module is configured to determine the target trajectory of the autonomous vehicle based on the visual features and the reference trajectory by a trajectory planning model, wherein the trajectory planning model is an end-to-end model.

10. A computing device, comprising: At least one processing unit; as well as At least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit, the instructions causing the computing device to perform the method according to any one of claims 1 to 8 when executed by the at least one processing unit.

11. A computer-readable storage medium having a computer program stored thereon, the computer program being executable by a processor to implement the method according to any one of claims 1 to 8.

12. A computer program product comprising computer-executable instructions, wherein the computer-executable instructions, when executed by a processor, implement the method according to any one of claims 1 to 8.