Vehicle exterior environment sensing method, vehicle terminal, server and computer program product

By combining vehicle voice recognition and environmental image processing technologies with location information, intelligent interaction between the vehicle and the user is achieved, solving the limitations of existing system interaction, providing real-time external information and personalized suggestions, and improving the driving experience.

CN119389232BActive Publication Date: 2026-06-19GUANGZHOU AUTOMOBILE GROUP CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
GUANGZHOU AUTOMOBILE GROUP CO LTD
Filing Date
2024-10-28
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing vehicle environmental perception systems have limitations in vehicle-user interaction, failing to dynamically adapt to users' real-time needs and behaviors. They mainly focus on specific functions such as lane keeping or obstacle detection, lacking comprehensiveness and intelligence.

Method used

The vehicle's voice recognition module acquires user voice information, combines it with environmental images captured by the vehicle's camera and location information from the positioning module, determines the name of the target object, and provides feedback to the user via voice broadcast or display screen, thus enabling interaction between the vehicle and the user.

🎯Benefits of technology

It enhances the interaction between the vehicle and the user, provides a more convenient and intelligent driving experience, improves driving safety and convenience, and can provide real-time external environment information and personalized driving suggestions.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN119389232B_ABST
    Figure CN119389232B_ABST
Patent Text Reader

Abstract

This application relates to a method for perceiving the external environment of a vehicle, an in-vehicle terminal, a server, and a computer program product, including: acquiring first user voice information based on a vehicle's voice recognition module, the first user voice information including the location information of a target object and indication information for obtaining the name of the target object; controlling the vehicle's camera to acquire an image of the external environment containing the target object; controlling the vehicle's positioning module to acquire the vehicle's current location information; determining the name information of the target object based on the external environment image, the vehicle's current location information, and the first user voice information; and prompting the user of the vehicle with the name information of the target object. This application enhances the interaction between the vehicle and the user, providing a more convenient and intelligent driving experience.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of vehicle technology, specifically to methods for perceiving the external environment of a vehicle, in-vehicle terminals, servers, and computer program products. Background Technology

[0002] Currently, most applications of vehicle environmental perception technology focus primarily on autonomous driving functions. They typically use sensors to detect the vehicle's surroundings and determine whether to trigger certain mechanisms based on the detection results, then perform automatic vehicle control, such as obstacle detection, lane keeping, and adaptive cruise control. Therefore, current environmental perception technology has certain limitations in terms of vehicle-user interaction, focusing on specific functions like lane keeping or obstacle detection, and is not comprehensive enough to dynamically adapt to the user's real-time needs and behaviors.

[0003] With the continuous advancement of vehicle-to-everything (V2X) technology and artificial intelligence (AI) technology, vehicles have indeed evolved from simple means of transportation into complex mobile intelligent platforms. This transformation places higher demands on the vehicle's environmental perception system. Therefore, technological innovation is urgently needed to enable the vehicle's environmental perception system to better support vehicle intelligence and provide users with a more convenient driving experience. Summary of the Invention

[0004] The purpose of this application is to propose a method for perceiving the external environment of a vehicle, an in-vehicle terminal, a server, and a computer program product to enhance the interaction between the vehicle and the user, and to provide the user with a more convenient and intelligent driving experience.

[0005] To achieve the above objectives, according to a first aspect of this application, a method for perceiving the external environment of a vehicle is provided, the method comprising:

[0006] The vehicle's voice recognition module acquires first user voice information, which includes the location information of the target object and indication information for obtaining the name of the target object.

[0007] The vehicle's camera is controlled to acquire images of the external environment containing the target object;

[0008] The vehicle's positioning module is controlled to obtain the vehicle's current location information;

[0009] The name information of the target object is determined based on the external environment image, the current location information of the vehicle, and the first user's voice information.

[0010] The name of the target object is displayed to the vehicle user.

[0011] According to a second aspect of this application, another method for perceiving the external environment of a vehicle is provided, the method comprising:

[0012] The system acquires images of the vehicle's external environment containing the target object, the vehicle's current location information, and the first user's voice information uploaded by the vehicle. The first user's voice information includes the target object's orientation information and indication information for obtaining the name of the target object.

[0013] The name information of the target object is determined based on the external environment image, the current location information of the vehicle, and the first user's voice information.

[0014] The name information of the target object is sent to the vehicle so that the vehicle can prompt the user with the name information of the target object.

[0015] According to a third aspect of this application, a vehicle-mounted terminal is provided, the vehicle-mounted terminal including a module for performing the method described in the first aspect of this application.

[0016] According to a fourth aspect of this application, a server is provided, the server including a module for performing the method described in the second aspect of this application.

[0017] According to a fifth aspect of this application, a computer program product is provided, including computer program instructions that instruct a computer device to perform an operation corresponding to the method described in the first or second aspect of this application.

[0018] The vehicle external environment perception method, vehicle terminal, server, and computer program product proposed in this application have the following beneficial effects:

[0019] During vehicle use, the in-vehicle terminal captures and parses the user's voice commands to obtain the first user voice information, and acquires images of the external environment and the vehicle's current geographical location information. Based on the first user voice information, the external environment images, and the geographical location information, it understands the user's question (i.e., querying the name information of the target object, such as what the tall building in front of you is), and obtains the name information of the target object according to the question, and then returns it to the in-vehicle terminal. The in-vehicle terminal then broadcasts the information via voice according to the received perception results (e.g., the tall building in front of you is a certain building). Through the above methods, the interaction between the vehicle and the user can be enhanced, providing the user with a more convenient and intelligent driving experience. Attached Figure Description

[0020] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings required in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0021] Figure 1 This is a flowchart of a vehicle external environment perception method according to Embodiment 1 of this application.

[0022] Figure 2 This is a schematic diagram illustrating the interaction principle between the vehicle-mounted terminal and the server in Embodiment 1 of this application.

[0023] Figure 3 This is a flowchart of a vehicle external environment perception method according to Embodiment 2 of this application. Detailed Implementation

[0024] The detailed description of the accompanying drawings is intended to illustrate the present embodiments of this application and is not intended to represent only the forms in which this application can be implemented. It should be understood that the same or equivalent functions can be accomplished by different embodiments intended to be included within the spirit and scope of this application.

[0025] See Figure 1 Embodiment 1 of this application provides a method for perceiving the external environment of a vehicle. This method is implemented based on an in-vehicle terminal and includes the following steps:

[0026] Step S11: The vehicle's voice recognition module acquires first user voice information, which includes the location information of the target object and indication information for obtaining the name of the target object.

[0027] Specifically, in one example, the voice recognition module may have a built-in microphone that can collect the voice commands of the user in the vehicle and filter out background noise while the vehicle is in motion to ensure accurate capture of the user's voice information. Finally, voice recognition technology, such as deep learning algorithms, is used to convert the user's voice commands into text information.

[0028] Specifically, vehicle users need to clearly indicate the location of the target object in the voice command, such as "left front" or "right rear", so that the vehicle terminal can locate the target object. Users need to provide instructions to obtain the name of the target object in the voice command, such as "What is the name of that building?" or "What is that sign?", which is the key for the vehicle terminal to identify the target object. For example, the voice command is "Please tell me the name of the building in the left front", or "Identify the meaning of the sign in the right rear", or "I want to know the name of the restaurant in front", and so on.

[0029] The voice recognition module converts the pre-processed sound signal into text information and extracts the location and indication information of the target object. Through this step, the vehicle terminal can understand the user's query intent, laying the foundation for subsequent environmental image acquisition and target object recognition. This voice-based interaction method not only facilitates the driver's access to information during driving, but also reflects the intelligent and human-centered design concept of the vehicle.

[0030] Step S12: Control the vehicle's camera to acquire images of the external environment containing the target object.

[0031] Specifically, for example, if the user's voice command is "Please tell me the name of the building to my left," then the exterior environment image captured by the camera to the left will be used as the exterior environment image containing the target object. As another example, if the user's voice command is "Identify the meaning of the sign to my right rear," then the exterior environment image captured by the camera to the right rear will be used as the exterior environment image containing the target object.

[0032] Step S13: Control the vehicle's positioning module to obtain the vehicle's current location information.

[0033] Specifically, the vehicle's positioning module can be GPS (Global Positioning System), GLONASS, BeiDou Navigation Satellite System, or other satellite navigation systems.

[0034] Step S14: Determine the name information of the target object based on the external environment image, the current location information of the vehicle, and the first user voice information.

[0035] Specifically, based on the first user voice information, the location information of the target object and the instruction information for obtaining the name of the target object can be obtained. By recognizing the external environment image of the vehicle, the characteristics of each object in the image can be determined. Based on the location information and the instruction information, the target object can be identified in the image. Then, based on the target object and the vehicle's current location information, the name information of the target object can be obtained by querying the internal database or calling the interface of a third-party ecosystem platform (such as map services, business information query services, etc.).

[0036] It should be noted that the process of determining the name information of the target object in step S40 of this embodiment can be implemented by the vehicle terminal alone, or it can be implemented through the interaction between the vehicle terminal and the server. That is, the vehicle terminal sends the external environment image, the current location information of the vehicle, and the first user voice information to the server, and requests the server to execute the process of determining the name information of the target object based on the external environment image, the current location information of the vehicle, and the first user voice information. Then, the vehicle terminal receives the name information of the target object returned by the server. The method of this embodiment is not limited to any one of the above methods. The specific design depends on the computing resources of the vehicle terminal. If the computing resources of the vehicle terminal are insufficient, it can be implemented on the server. If the computing resources of the vehicle terminal are sufficient, it can be implemented by the vehicle terminal alone.

[0037] Step S15: Prompt the user of the vehicle to provide the name information of the target object.

[0038] Specifically, text-to-speech (TTS) technology can be used to convert the name information of the target object into natural and fluent speech for broadcast. Furthermore, in addition to voice broadcasting, the in-vehicle terminal can also provide visual feedback through the vehicle's central control display screen, displaying the name and / or image of the target object, highlighting the target object on the display screen, or labeling its name next to it.

[0039] In summary, the method of this embodiment, during vehicle use, involves the in-vehicle terminal capturing and parsing the user's voice commands to obtain first user voice information, acquiring images of the external environment and the vehicle's current geographical location information, understanding the user's question (i.e., querying the name information of a target object, such as what the tall building to your left is), obtaining the name information of the target object based on the question, and then returning it to the in-vehicle terminal. The in-vehicle terminal then broadcasts the received perception results via voice (e.g., the tall building to your left is a certain building). Through this method, the interaction between the vehicle and the user can be enhanced, providing the user with a more convenient and intelligent driving experience.

[0040] In some embodiments, the first user voice information further includes: description information of the target object, the description information including at least one of the following: color information of the target object, type information of the target object, name information of surrounding objects of the target object, and type information of surrounding objects of the target object.

[0041] Specifically, the descriptive information further provides more target features for identifying the target object in the external environment image; the vehicle terminal or server filters objects of the corresponding color in the image based on the color information provided by the user (such as "red building"); the type information (such as "commercial building") helps the vehicle terminal or server to further narrow down the recognition range and identify only specific types of buildings; if the user mentions the name information of objects around the target object (such as "building next to the bank"), the vehicle terminal can use the name information of the objects around the target object as an auxiliary clue to locate the target object; the type information of the objects around the target object (such as "building near the park") can also help the vehicle terminal locate the target object in the image.

[0042] In some embodiments, determining the name information of the target object based on the external environment image, the vehicle's current location information, and the first user voice information includes:

[0043] The first user's voice information, the image of the external environment, and the vehicle's current location information are sent to the server to request the server to determine the name information of the target object;

[0044] Receive the name information of the target object returned by the server;

[0045] Specifically, Figure 2 The diagram illustrates the interaction principle between the vehicle-mounted terminal and the server. The vehicle-mounted terminal captures the user's first voice command via a microphone or other voice input device. This first voice command is a question the user asks about objects in the external environment, such as buildings or service facilities, for example, "What is that building to the left?". When the first voice command is captured, the vehicle-mounted terminal parses it to obtain the user's voice information. Simultaneously, it triggers the acquisition of images of the external environment and the vehicle's current location information. Generally, intelligent vehicles are equipped with multiple cameras. During vehicle use, such as... Figure 2As shown, the multiple cameras are used to capture video streams of the environment from different directions. After preprocessing and compression by an analog-to-digital converter (ADC), the video streams are sent to the vehicle terminal via the RTP protocol. The vehicle terminal determines the external environment image from the video stream of the corresponding camera based on the first voice information and performs image anonymization. For example, if a user asks, "What is that building to the left?", the vehicle terminal will extract a frame from the video stream of the left front camera as the external environment image and perform image anonymization. The timestamp of this frame is consistent with the timestamp when the vehicle terminal captures the first voice command. The vehicle's current location information is the vehicle's location information (latitude and longitude information) corresponding to the timestamp when the vehicle terminal captures the first voice command. The vehicle terminal compresses and encrypts the first user voice information, the external environment image, and the vehicle's current location information to generate a first query request and sends the first query request to the server.

[0046] Specifically, the in-vehicle terminal receives the name information of the target object from the server via the network. Upon receiving the perception results, the terminal processes this information to convey it to the user in a suitable voice format. The in-vehicle terminal uses text-to-speech (TTS) technology to convert the processed perception results into voice information, which is then broadcast to the user through the in-vehicle audio system. The voice announcement might be: "The building to your left is the XX Building, a landmark building in City A. It is an Art Deco skyscraper built in 1980, primarily used for offices and sightseeing." Users can obtain specific information about their external environment through hearing without needing to look at a screen or perform other operations, thus improving driving safety and convenience.

[0047] The server determines the name information of the target object in the following way:

[0048] The first user's voice information, the external environment image, and the vehicle's current location information are respectively used to extract features to obtain a voice feature vector, an image feature vector, and a geographic location feature vector;

[0049] The speech feature vector, image feature vector, and geographic location feature vector are fused to obtain a fused feature vector, and a decision is made based on the fused feature vector to obtain the type and probability of each object in the vehicle exterior environment image;

[0050] A first query task is generated based on the type and probability of each object;

[0051] The corresponding preset expert model is determined based on the first query task, and the preset expert model is called to execute the first query task, querying the preset knowledge database or calling the interface of a third-party ecosystem platform to obtain the name information of the target object.

[0052] Specifically, such as Figure 2 As shown, in response to receiving the first query request, the server performs the following operations:

[0053] (1.1) Parse (decompress and decrypt) the first query request to obtain the first user's voice information, the image of the external environment of the vehicle, and the current location information of the vehicle;

[0054] (1.2) Feature extraction is performed on the first user's voice information, the external environment image, and the vehicle's current location information to obtain a voice feature vector, an image feature vector, and a geographic location feature vector (i.e., ...). Figure 2 (Text and image tags in the text);

[0055] Specifically, in this embodiment, for ease of description, the image feature vector is defined as V_{image}, which can be extracted by a convolutional neural network (CNN); the speech feature vector is defined as V_{voice}, which can be extracted by a natural language processing (NLP) model (e.g., the Transformer model); and the geographic location feature vector is defined as V_{location}, which can be a feature vector encoded from geographic coordinates.

[0056] (1.3) The speech feature vector, image feature vector and geographic location feature vector are fused to obtain a fused feature vector, and a decision is made based on the fused feature vector to obtain the type and probability of each object in the vehicle exterior environment image;

[0057] Specifically, before fusion, due to the significant differences in the scale of feature vectors from different modalities, V_{image}, V_{voice}, and V_{location} are first standardized to ensure they are on the same order of magnitude. This can be achieved using techniques such as BatchNormalization. The standardized V_{image}, V_{voice}, and V_{location} are then concatenated to form a multimodal feature vector V_{multi}, which can be represented as:

[0058] V_{multi}=[V_{image},V_{voice},V_{location}];

[0059] This embodiment designs a multimodal fusion model to fuse V_{image}, V_{voice}, and V_{location}. During the stitching process, the multimodal fusion model introduces an attention mechanism to assign different weights to different modalities. The attention mechanism helps the model focus on more important information sources. The attention weights can be represented as:

[0060] a i =softmax(W a *V i );

[0061] Among them, a i W represents the attention weights for the i-th modality. a Let V be the weight matrix. i Let be the eigenvector of the i-th mode.

[0062] Furthermore, by using attention weights to perform a weighted summation of the feature vectors of each modality, it can be expressed as:

[0063]

[0064] Among them, V weighted This is the weighted summation result;

[0065] Finally, an activation function is used to perform a nonlinear transformation on the fused feature vector to capture complex feature relationships, which can be expressed as:

[0066] V transformed =f(w l *V weighted +b l )

[0067] Among them, V transformed Let w be the fused feature vector. l and b l These are the weight matrix and the bias vector, respectively, and f() represents the activation function.

[0068] The final part of the multimodal fusion model is a decision layer, which uses reinforcement learning algorithms to optimize the decision-making process, based on V. transformed Output the type and probability of each object in the vehicle exterior environment image. An example is given below to illustrate this:

[0069] Input: V_transformed (fused feature vector);

[0070] Decision layer processing: Reinforcement learning algorithms process V_transformed, predicting object type through a series of neural network layers and policies;

[0071] Output: The decision layer outputs a series of object types and their corresponding probabilities, such as "building" with a probability of 0.85, "traffic sign" with a probability of 0.10, and "animal" with a probability of 0.05.

[0072] (1.4) Generate a first query task based on the type and probability of each object;

[0073] Specifically, based on the type and probability of each object, the relevant feature information of the object with the highest probability is selected as the retrieval information for the expert model to perform knowledge query. In the example above, "building" with a probability of 0.85 is selected as the target object to be queried. At this time, content mapping is performed to generate the first query task. The first query task includes the relevant feature information of "building" with a probability of 0.85, as well as the first user's voice information, the image of the external environment of the vehicle, and the current location information of the vehicle.

[0074] (1.5) Determine the corresponding preset expert model based on the first query task, and call the preset expert model to execute the first query task, query the preset knowledge database or call the third-party ecosystem platform interface to obtain the name information of the target object.

[0075] For example, when the first query task is about identifying a "building," it is assigned to a corresponding expert model. This expert model is specifically designed to identify and provide building-related information. The selected expert model receives the first query task and begins performing the following operations: The expert model parses the information in the first query task, including the building's visual features, the user's voice description, images of the vehicle's external environment, and the vehicle's location information. The expert model first queries a pre-set knowledge database containing relevant building information, such as the building's name, history, and purpose. If a matching result is found in the knowledge database, that result is output as the target object's name information. If no matching result is found in the knowledge database, or if more detailed or specific information is needed, a third-party ecosystem platform interface is invoked. This third-party ecosystem platform could be a map service, etc., to obtain more comprehensive data. Specifically, the target object's name information can include the building's name, type, and other relevant attributes, such as "The building to your left is [building name], it is a [architectural style] building, built in [year of construction], and mainly used for [purpose]."

[0076] In this embodiment, during vehicle use, the in-vehicle terminal captures and parses the user's voice commands to obtain the first user's voice information. Simultaneously, it triggers the acquisition of external environment images and the vehicle's current geographical location information. Through vehicle networking technology, the first user's voice information, external environment images, and geographical location information are sent to the server. Based on the first user's voice information, external environment images, and geographical location information, the server understands the user's question and obtains the name information of the target object based on the question. This information is then returned to the in-vehicle terminal, which performs voice broadcast based on the received perception results. Through this method, the interaction between the vehicle and the user can be enhanced, providing the user with a more convenient and intelligent driving experience.

[0077] In some embodiments, the method further includes:

[0078] Step S16: Receive the first location service information or first vehicle use suggestion related to the target object returned by the server, and prompt the user of the vehicle with the first location service information or first vehicle use suggestion;

[0079] The first location service information or the first vehicle recommendation is obtained by the server based on the name information of the target object.

[0080] Specifically, in this embodiment, the vehicle terminal not only receives the name information of the target object, but also receives location service information or vehicle usage suggestions related to these objects. While the server analyzes the target object perception results, it also provides additional information based on these results, namely, first location service information or first vehicle usage suggestions. The first location service information includes, for example, information about nearby restaurants, parking lots, gas stations, hotels, and other service facilities, as well as their opening hours, user ratings, and contact information. The first vehicle usage suggestions are, for example, suggestions based on factors such as the vehicle's current location, destination, traffic conditions, and vehicle status (e.g., fuel level, battery level), such as suggested driving routes, locations of gas or charging stations, and vehicle maintenance reminders.

[0081] Furthermore, after receiving this information, the in-vehicle terminal processes it to select the most relevant and useful information to broadcast to the user. This processing may include filtering, sorting, and formatting the information to ensure the broadcast content is concise and clear. The in-vehicle terminal uses text-to-speech (TTS) technology to convert the processed location service information or driving suggestions into speech, which is then broadcast to the user through the in-vehicle audio system. For example, the broadcast might say: "The gas station 1 kilometer ahead is currently offering a promotion; we recommend you refuel there. Additionally, the nearby 'Food Paradise' restaurant offers specialty dishes; you can dine there after refueling."

[0082] Through step S16, the vehicle terminal not only provides direct information about the external environment, but also provides practical service information and personalized vehicle usage suggestions, thereby further enhancing the functionality and user-friendliness of the vehicle system. This integrated service information provision makes the vehicle not just a means of transportation, but an intelligent partner in the user's daily life.

[0083] In some embodiments, the method further includes:

[0084] Step S17: The vehicle's voice recognition module acquires second user voice information, which includes instructions for obtaining location service information or vehicle usage suggestions related to the target object.

[0085] Step S18: Send the second user voice information to the server to request the server to determine location service information or car rental suggestions related to the target object.

[0086] Specifically, the vehicle-mounted terminal captures a second voice command issued by the user through a microphone or other voice input device installed in the vehicle. The second voice command is a question from the user about the target object (such as a building or service facility). For example, if the first voice command is to ask "What is that building to the left?", the second voice command could be "Is there a coffee shop in this building?". When the vehicle-mounted terminal captures the first voice command, it creates a new session. During this session, if the second voice command is captured, the vehicle-mounted terminal parses the second voice command to obtain the second user's voice information. At this time, the vehicle-mounted terminal determines whether to create a new session based on the second user's voice information. If not, it will not trigger the capture of the external environment image or the acquisition of the current vehicle location information. Instead, it will process the command through the server using the context information of the current session. The vehicle-mounted terminal will generate a second query request based on the second user's voice information and send the second query request to the server.

[0087] Step S19: Receive the second location service information or second vehicle use suggestion returned by the server, and prompt the vehicle user with the second location service information or second vehicle use suggestion.

[0088] For example, after receiving the information, the vehicle terminal can tell the user through voice broadcast: "There is a large underground parking lot in the shopping center. There are currently empty spaces. The entrance is on the north side. You can turn left at the next intersection." Through steps S17 to S19, the server can use the user's voice commands and session memory data to provide accurate and timely service information through expert models, providing users with a more coherent and personalized service experience.

[0089] The server determines the location service information or transportation suggestions related to the target object in the following ways:

[0090] Obtain the memory data of the current session, which includes all interaction data between the vehicle and the server during the session;

[0091] A second query task is generated based on the second user's voice information and the memory data;

[0092] The corresponding preset expert model is determined based on the second query task, and the preset expert model is called to execute the second query task, querying the preset knowledge database or calling the interface of a third-party ecosystem platform to obtain second location service information or second vehicle use suggestions related to the target object.

[0093] In response to receiving the second query request, the server performs the following operations:

[0094] (2.1) Parse the second query request to obtain the second user's voice information;

[0095] Specifically, the server has short-term / long-term memory capabilities. When the server parses the second query request and only obtains information in the modality of the second user's voice, the server can determine that it should use the context information of the current session to process the second query request, thereby realizing a multi-turn session.

[0096] (2.2) Obtain the memory data of the current session, wherein the memory data includes all interaction data between the vehicle and the server during the session;

[0097] Specifically, all the interactive data includes all information that the vehicle interacts with the server in the current session. This information may include, but is not limited to, user voice command records, images of the external environment, vehicle location information, and server response records. The user voice command records refer to the historical records of all voice commands given by the user in the current session, such as the first user voice information. The server response records refer to all information that the server returns to the vehicle during the session, including the name information of the target object, location service information, and vehicle usage suggestions.

[0098] The memory data provides the server with the contextual information needed to process the second query request. For example, if the user's second voice command is about a specific building mentioned earlier, the server can quickly locate the relevant information about that building through the memory data. By analyzing the memory data, the server can better understand the user's intent and needs, thereby providing more accurate and relevant service information. With the memory data, the server does not need to repeat previously completed calculations or queries and can respond quickly based on existing information, improving the overall efficiency of the system. The memory data is stored in the server's temporary storage, and the server ensures the security and privacy of this data, using it only to provide and improve in-vehicle services and not disclosing it to third parties.

[0099] (2.3) Generate a second query task based on the second user voice information and the memory data;

[0100] Specifically, the server generates a second query task by performing content mapping based on the second user's voice information and the memory data. For example, in the aforementioned example, the second query task may include the name information of the target object, the second user's voice information, an image of the vehicle's external environment, and the vehicle's current location information. The server determines a corresponding preset expert model based on the second query task and calls the preset expert model to execute the second query task, querying a preset knowledge database or calling a third-party ecosystem platform interface to obtain the name information of the target object.

[0101] In one example, suppose a user, while driving, issues the following first voice command through the in-vehicle terminal: "What's that building to the left?" The server, based on the first voice command and the vehicle's location information, returns the answer: "That's a shopping mall in the city center." Subsequently, the user issues a second voice command: "I want to know if there's a parking lot there." The server receives the user's second voice command: "I want to know if there's a parking lot there."

[0102] The server extracts information related to the first query from the memory data, including: building-related feature information with a probability of 0.85 (referring to the shopping mall), the first user's voice information ("What is that building to the left?"), and images of the external environment and the vehicle's current location. The server maps this information to a specific second query task, which is used to determine whether the downtown shopping mall has parking.

[0103] (2.4) Determine the corresponding preset expert model according to the second query task, and call the preset expert model to execute the second query task, query the preset knowledge database or call the third-party ecosystem platform interface to obtain the second location service information or second vehicle suggestion related to the target object.

[0104] Specifically, based on the generated second query task, the server determines which preset expert model to use. In this example, it might be a model specifically for handling location service information, such as a "parking information expert model." The server calls the "parking information expert model" to execute the query task, and this model performs the following operations:

[0105] The system can query a pre-defined knowledge database containing detailed information about the shopping center, including parking lot locations, capacity, and current usage. Alternatively, it can call an interface from a third-party ecosystem platform, such as the city's intelligent transportation system, to obtain real-time parking information.

[0106] After the expert model finishes execution, the server returns the parking information it obtained to the vehicle terminal.

[0107] In some embodiments, the method further includes:

[0108] Step S110: After prompting the user of the vehicle with the first vehicle use suggestion or the second vehicle use suggestion, wait for a preset time. If a third user voice message is received within the preset time, determine whether to execute the vehicle control function corresponding to the first vehicle use suggestion or the second vehicle use suggestion based on the third user voice message.

[0109] Specifically, step S110 is an interactive feedback and vehicle control process that allows the in-vehicle system to decide whether to execute corresponding vehicle control functions based on further user instructions after providing usage suggestions. Specifically, after receiving the first or second usage suggestion from the server, the in-vehicle terminal will inform the user via voice announcement, for example, saying, "The XX Building is to your left." Simultaneously, it will provide relevant service information or usage suggestions, such as, "The building is 200 meters from your destination. Parking is available at 10 yuan per hour. Do you want navigation to the parking lot?"

[0110] After the announcement is finished, the in-vehicle system will enter a waiting state. The preset time may be a few seconds, which is enough for the user to understand the suggestion and react. This preset time is designed based on user experience. It cannot be too short, so that the user will not have enough time to react, nor can it be too long, so as to avoid unnecessary waiting.

[0111] If the user responds to the in-vehicle system's suggestion within a preset time, meaning the in-vehicle terminal receives the user's third voice command, parses it to obtain the user's voice information, and determines whether the user agrees to execute the vehicle control function suggested in the suggestion, then if the user agrees (e.g., the user says "okay"), the in-vehicle system will execute the corresponding vehicle control function, such as navigating to the parking lot. If the user disagrees (e.g., the user says "no"), the in-vehicle system will not execute the suggested vehicle control function, but will continue to maintain the current driving state or wait for further instructions from the user.

[0112] See Figure 3 Embodiment 2 of this application proposes a method for perceiving the external environment of a vehicle, including the following steps:

[0113] Step S21: Obtain the vehicle-uploaded image of the external environment containing the target object, the vehicle's current location information, and the first user's voice information; the first user's voice information includes the target object's location information and indication information for obtaining the name of the target object.

[0114] Step S22: Determine the name information of the target object based on the external environment image, the current location information of the vehicle, and the first user voice information.

[0115] Step S23: Send the name information of the target object to the vehicle so that the vehicle can prompt the user with the name information of the target object.

[0116] In some embodiments, the first user voice information further includes: description information of the target object, the description information including at least one of the following: color information of the target object, type information of the target object, name information of surrounding objects of the target object, and type information of surrounding objects of the target object.

[0117] In some embodiments, step S22 includes:

[0118] Step S221: Extract features from the first user's voice information, the external environment image, and the vehicle's current location information to obtain a voice feature vector, an image feature vector, and a geographic location feature vector, respectively.

[0119] Step S222: The speech feature vector, image feature vector and geographic location feature vector are fused to obtain a fused feature vector, and a decision is made based on the fused feature vector to obtain the type and probability of each object in the vehicle exterior environment image;

[0120] Step S223: Generate a first query task based on the type and probability of each object;

[0121] Step S224: Determine the corresponding preset expert model based on the first query task, and call the preset expert model to execute the first query task, query the preset knowledge database or call the interface of a third-party ecosystem platform to obtain the name information of the target object.

[0122] In some embodiments, step S22 further includes:

[0123] Step S225: Obtain first location service information or first vehicle use suggestion related to the target object based on the name information of the target object, and send the first location service information or first vehicle use suggestion to the vehicle so that the vehicle can prompt the user with the first location service information or first vehicle use suggestion.

[0124] In some embodiments, the method further includes:

[0125] Step S241: Obtain second user voice information sent by the vehicle. The second user voice information includes instruction information for obtaining location service information or vehicle use suggestions related to the target object.

[0126] Step S242: Obtain the memory data of the current session, wherein the memory data includes all interaction data between the vehicle and the server during the session;

[0127] Step S243: Generate a second query task based on the second user voice information and the memory data;

[0128] Step S244: Determine the corresponding preset expert model according to the second query task, and call the preset expert model to execute the second query task, query the preset knowledge database or call the third-party ecological platform interface to obtain the second location service information or the second vehicle use suggestion related to the target object;

[0129] Step S245: Send the second location service information or the second vehicle use suggestion to the vehicle so that the vehicle can prompt the user with the second location service information or the second vehicle use suggestion.

[0130] It should be noted that the method described in Embodiment 2 corresponds to the method described in Embodiment 1. The specific principle of the server executing the above steps of the method described in Embodiment 2 has been disclosed in detail in the description of the method described in Embodiment 1. Therefore, the relevant technical details of the above steps of the method described in Embodiment 2 can be obtained and understood by referring to the description of the method described in Embodiment 1, and will not be repeated in this embodiment.

[0131] In some embodiments, the step of calling a third-party ecosystem platform interface to obtain the name information of the target object includes:

[0132] A request is generated based on the vehicle's current location information, and the request is sent to the third-party ecosystem platform interface.

[0133] The system receives point-of-interest (POI) information returned by the third-party ecosystem platform interface and obtains the name information of the target object based on the POI information; wherein the POI information is relevant information in the geographic information system corresponding to the location of the target object.

[0134] Specifically, Point of Interest (POI) information refers to information about a specific location in a Geographic Information System (GIS). This information about a specific location is of particular interest or importance to certain users or applications. POI information can include various types of data. Some common POIs include commercial facilities, public service venues, leisure and entertainment venues, and transportation facilities. Specific POI information includes name, location, type, address, contact information, business hours, user ratings, service information, etc.

[0135] In this embodiment, the server can obtain the name information of the target object by calling the interface of a third-party ecosystem platform. This process involves data exchange and integration with external services. The server first needs to generate a call request based on the vehicle's current location information (latitude and longitude coordinates). This request contains necessary information, such as the vehicle's location, the type of request (e.g., obtaining nearby points of interest), and other possible parameters (e.g., search radius, type of interest, etc.). The format of the request follows the API specification of the third-party ecosystem platform interface to ensure that the data can be correctly parsed and processed.

[0136] After generating the call request, the server sends it over the network to the API interface of the third-party ecosystem platform. This interface could be a map service, location information service, or any service that provides geographic information data. Upon receiving the call request through this interface, the third-party ecosystem platform queries relevant Points of Interest (POIs) in the geographic information system based on the request parameters. These PPIs include, for example, the name, address, type, user ratings, and business hours of the business. After completing the query, the third-party ecosystem platform returns the PPI information as a response to the server. Upon receiving the PPI information from the third-party ecosystem platform, the server parses and processes it. The server can filter out the most relevant PPI information based on the specific location of the target object. The server can also combine this information with the name information of the target object to provide richer and more accurate data. For example, if the target object is a building, the server can also return information such as the building's name, main functions, and entrance location.

[0137] Embodiment 3 of this application provides an in-vehicle terminal, including a module for executing the external environment perception method described in Embodiment 1. This module can be a software module, a hardware module, or a combination of software and hardware.

[0138] Embodiment 4 of this application provides a server, including a module for executing the vehicle external environment perception method described in Embodiment 2. The module can be a software module, a hardware module, or a combination of software and hardware.

[0139] Embodiment 5 of this application provides a computer program product, including computer program instructions that instruct a computer device to perform an operation corresponding to the method described in Embodiment 1.

[0140] Specifically, the computer program product includes a series of computer program instructions that instruct a computer device to execute the vehicle external environment perception method described in Embodiment 1. These instructions are code written in a computer program that defines how to perform specific operations. In this embodiment, these instructions are used to execute the vehicle external environment perception method of Embodiment 1 described above.

[0141] These program instructions are designed to be loaded onto a computer device and to instruct the device to perform specific operations, which refer to the various steps in the vehicle external environment perception method described in Embodiment 1 above.

[0142] In this way, the computer program product provides a complete software solution that can run on various computer devices to implement the vehicle external environment perception method of Embodiment 1 above.

[0143] Embodiment Six of this application provides a computer program product, including computer program instructions that instruct a computer device to perform an operation corresponding to the method described in Embodiment Two.

[0144] Specifically, the computer program product includes a series of computer program instructions that can instruct a computer device to execute the vehicle external environment perception method described in Embodiment 2. These instructions are code written in a computer program that defines how to perform specific operations. In this embodiment, these instructions are used to execute the vehicle external environment perception method of Embodiment 2 described above.

[0145] These program instructions are designed to be loaded onto a computer device and to instruct the device to perform specific operations, which refer to the various steps in the vehicle external environment perception method described in Embodiment 2 above.

[0146] In this way, the computer program product provides a complete software solution that can run on various computer devices to implement the vehicle external environment perception method of Embodiment 2 above.

[0147] The various embodiments of this application have been described above. These descriptions are exemplary and not exhaustive, nor are they limited to the disclosed embodiments. Many modifications and variations will be apparent to those skilled in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen to best explain the principles, practical applications, or technological improvements to the embodiments in the market, or to enable others skilled in the art to understand the embodiments disclosed herein.

Claims

1. A method for perceiving the external environment of a vehicle, characterized in that, The method includes: The vehicle's voice recognition module acquires first user voice information, which includes the location information of the target object and indication information for obtaining the name of the target object. The vehicle's camera is controlled to acquire images of the external environment containing the target object; The vehicle's positioning module is controlled to obtain the vehicle's current location information; The name information of the target object is determined based on the external environment image, the current location information of the vehicle, and the first user's voice information. The name of the target object is displayed to the vehicle user. Determining the name information of the target object based on the external environment image, the vehicle's current location information, and the first user's voice information includes: The first user's voice information, the image of the external environment, and the vehicle's current location information are sent to the server to request the server to determine the name information of the target object; Receive the name information of the target object returned by the server; The server determines the name information of the target object in the following way: The first user's voice information, the external environment image, and the vehicle's current location information are respectively used to extract features to obtain a voice feature vector, an image feature vector, and a geographic location feature vector; The speech feature vector, image feature vector, and geographic location feature vector are fused to obtain a fused feature vector, and a decision is made based on the fused feature vector to obtain the type and probability of each object in the vehicle exterior environment image; A first query task is generated based on the type and probability of each object; The corresponding preset expert model is determined based on the first query task, and the preset expert model is called to execute the first query task, querying the preset knowledge database or calling the interface of a third-party ecosystem platform to obtain the name information of the target object.

2. The method as described in claim 1, characterized in that, The first user voice information further includes: description information of the target object, the description information including at least one of the following: color information of the target object, type information of the target object, name information of surrounding objects of the target object, and type information of surrounding objects of the target object.

3. The method according to claim 2, characterized in that, The method further includes: The system receives first location service information or first vehicle use suggestion related to the target object returned by the server, and prompts the user of the vehicle with the first location service information or first vehicle use suggestion; The first location service information or the first vehicle recommendation is obtained by the server based on the name information of the target object.

4. The method according to claim 3, characterized in that, The method further includes: The vehicle's voice recognition module acquires second user voice information, which includes instructions for obtaining location service information or vehicle usage suggestions related to the target object. The second user voice information is sent to the server to request the server to determine location service information or car rental suggestions related to the target object; Receive the second location service information or second vehicle use suggestion returned by the server, and prompt the vehicle user with the second location service information or second vehicle use suggestion; The server determines the location service information or transportation suggestions related to the target object in the following ways: Obtain the memory data of the current session, which includes all interaction data between the vehicle and the server during the session; A second query task is generated based on the second user's voice information and the memory data; The corresponding preset expert model is determined based on the second query task, and the preset expert model is called to execute the second query task, querying the preset knowledge database or calling the interface of a third-party ecosystem platform to obtain second location service information or second vehicle use suggestions related to the target object.

5. The method according to claim 3 or 4, characterized in that, The method further includes: After prompting the vehicle user with the first or second vehicle usage suggestion, wait for a preset time. If a third user voice message is received within the preset time, determine whether to execute the vehicle control function corresponding to the first or second vehicle usage suggestion based on the third user voice message.

6. A method for perceiving the external environment of a vehicle, characterized in that, The method includes: The system acquires images of the vehicle's external environment containing the target object, the vehicle's current location information, and the first user's voice information uploaded by the vehicle. The first user's voice information includes the target object's orientation information and indication information for obtaining the name of the target object. The name information of the target object is determined based on the external environment image, the current location information of the vehicle, and the first user's voice information. The name information of the target object is sent to the vehicle so that the vehicle can prompt the user with the name information of the target object; Determining the name information of the target object based on the external environment image, the vehicle's current location information, and the first user's voice information includes: The first user's voice information, the external environment image, and the vehicle's current location information are respectively used to extract features to obtain a voice feature vector, an image feature vector, and a geographic location feature vector; The speech feature vector, image feature vector, and geographic location feature vector are fused to obtain a fused feature vector, and a decision is made based on the fused feature vector to obtain the type and probability of each object in the vehicle exterior environment image; A first query task is generated based on the type and probability of each object; The corresponding preset expert model is determined based on the first query task, and the preset expert model is called to execute the first query task, querying the preset knowledge database or calling the interface of a third-party ecosystem platform to obtain the name information of the target object.

7. The method as described in claim 6, characterized in that, The first user voice information further includes: description information of the target object, the description information including at least one of the following: color information of the target object, type information of the target object, name information of surrounding objects of the target object, and type information of surrounding objects of the target object.

8. The method according to claim 6 or 7, characterized in that, The method further includes: Based on the name information of the target object, obtain the first location service information or the first vehicle use suggestion related to the target object, and send the first location service information or the first vehicle use suggestion to the vehicle so that the vehicle can prompt the user with the first location service information or the first vehicle use suggestion.

9. The method according to claim 8, characterized in that, The method further includes: Acquire second user voice information sent by the vehicle, the second user voice information containing instructions for obtaining location service information or vehicle use suggestions related to the target object; Obtain the memory data of the current session, which includes all interaction data between the vehicle and the server during the session; A second query task is generated based on the second user's voice information and the memory data; The corresponding preset expert model is determined according to the second query task, and the preset expert model is called to execute the second query task, query the preset knowledge database or call the third-party ecological platform interface to obtain the second place service information or the second vehicle use suggestion related to the target object; The second location service information or second vehicle use suggestion is sent to the vehicle so that the vehicle can prompt the user with the second location service information or second vehicle use suggestion.

10. The method according to claim 9, characterized in that, The process of obtaining the name information of the target object by calling the interface of a third-party ecosystem platform includes: A request is generated based on the vehicle's current location information, and the request is sent to the third-party ecosystem platform interface. The system receives point-of-interest (POI) information returned by the third-party ecosystem platform interface and obtains the name information of the target object based on the POI information; wherein the POI information is relevant information in the geographic information system corresponding to the location of the target object.

11. A vehicle-mounted terminal, characterized in that, Includes a module for performing the vehicle external environment perception method according to any one of claims 1 to 5.

12. A server, characterized in that, Includes a module for performing the vehicle external environment perception method according to any one of claims 6 to 10.

13. A computer program product, characterized in that, It includes computer program instructions that instruct a computer device to perform an operation corresponding to the method as described in any one of claims 1 to 10.

Citation Information

Patent Citations

  • User instruction identification method and device, storage medium and equipment

    CN114299943A

  • Voice question answering method and device in driving scene and vehicle-mounted terminal

    CN115312061A