Voice instruction interaction method and device, electronic equipment and storage medium

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
By converting audio information into text and combining it with multimodal model processing, instructions matching user intent are generated, solving the problem of customized voice in voice interaction, realizing end-to-end "what you see is what you can say" across all scenarios, and improving the efficiency and accuracy of voice interaction.

CN117351952BActive Publication Date: 2026-06-23CHONGQING CHANGAN TECH CO LTD

View PDF 1 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: CHONGQING CHANGAN TECH CO LTD
Filing Date: 2023-09-19
Publication Date: 2026-06-23

Application Information

Patent Timeline

19 Sep 2023

Application

23 Jun 2026

Publication

CN117351952B

IPC: G10L15/22; G10L15/26

AI Tagging

Application Domain

Speech recognition

Technology Topics

Driver/operator In vehicle

Technical Efficacy Phrases

Accurately understand the purpose of answering questionsAccurate insights into search intent

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure CN117351952B_ABST

Patent Text Reader

Abstract

The application relates to a voice instruction interaction method and device, electronic equipment and a storage medium, and relates to the technical field of full-scene voice interaction. The method comprises the following steps: receiving audio information, and converting the audio information into first text information; in the case that the first text information meets a first condition, obtaining second text information matched with the first text information from a database; and generating a reply instruction matched with a user intention according to the audio information, the first text information, the second text information and interface information of a current interface of a vehicle terminal. Therefore, the generalization ability of voice interaction can be effectively improved, and the interaction intention of a driver can be accurately understood.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of vehicle technology, and more particularly to the field of full-scene voice interaction technology, specifically to a voice command interaction method, device, electronic device and storage medium. Background Technology

[0002] Because drivers need to focus on road conditions while driving, voice interaction has become a standard core function of smart cockpits. During voice interaction, the in-vehicle terminal typically matches the text information converted from the driver's voice input, then obtains and executes the operation command based on the matching result.

[0003] However, currently, when conducting voice interaction, customized voice commands are often required for long, continuous speech or specific phrases in order to match the corresponding instructions and realize the driver's operational intentions. This results in weak generalization ability of voice interaction and an inability to accurately understand the driver's operational intentions. Summary of the Invention

[0004] This application provides a voice command interaction method, apparatus, electronic device, and storage medium to solve the technical problem in related technologies where customized voice is often required for interaction with continuous long speech or special expressions. The technical solution of this application is as follows:

[0005] According to a first aspect of this application, a voice command interaction method is provided, comprising:

[0006] The system receives audio information and converts it into first text information. The audio information reflects the user's intent. If the first text information meets a first condition, the system retrieves second text information that matches the first text information from the database. The first condition identifies the first text information as either question-and-answer information or query information. Based on the audio information, the first text information, the second text information, and the interface information of the current interface of the vehicle terminal, the system generates a response instruction that matches the user's intent.

[0007] Through the aforementioned technical means, this application can convert audio information into first text information after receiving it. If the first text information is a question-and-answer or query type, it can combine the audio information, the first text information, second text information matching the first text information, and the interface information of the current interface of the vehicle terminal to generate a response command matching the user's intent, without requiring customized voice interaction. Therefore, the voice command interaction method provided by this application embodiment can effectively improve the generalization ability of voice interaction and accurately understand the driver's answer-type or search intent.

[0008] In one possible implementation, the method further includes:

[0009] If the first text information satisfies the second condition, the audio information, the first text information, and the interface information of the current interface of the vehicle terminal are input into the pre-trained multimodal model to obtain the operation instructions that match the user's intention; the second condition is used to identify the first text information as control information; the operation instructions are used to instruct the corresponding control operation to be performed on the target controlled object in the interface information.

[0010] Through the above technical solution, this application can, when the first text information is control information, combine audio information, the first text information, and the interface information of the current interface of the vehicle terminal to obtain the operation instruction for performing the corresponding control operation on the target controlled object in the interface information. Therefore, the target controlled object can be directly matched and identified from the interface information of the current interface, and the operation instruction for performing the control operation on the target controlled object can be obtained, which can effectively improve the efficiency of voice interaction and quickly understand the driver's operation intention.

[0011] In one possible implementation, a response instruction matching the user's intent is generated based on audio information, first text information, second text information, and interface information of the current interface of the vehicle terminal, including:

[0012] Audio information, interface information, first text information, and second text information are input into a pre-trained multimodal model to generate response instructions that match the user's intent.

[0013] The above technical solution can integrate multimodal features such as interface information (i.e., image information), audio information, first text information, and second text information from the current interface, and combine them with a pre-trained multimodal model to generate instructions that match the user's intent. Therefore, it can achieve end-to-end "see-and-say" functionality across all scenarios.

[0014] In one possible implementation, the multimodal model includes an image processing model, an audio processing model, and a generative language model; audio information, interface information, first text information, and second text information are input into the pre-trained multimodal model to generate response instructions that match the user's intent, including:

[0015] The audio information is input into the audio processing model to obtain the audio feature information corresponding to the audio information; the interface information is input into the image processing model to obtain the interface feature information corresponding to the interface information; the audio feature information, interface feature information, first text information and second text information are input into the generative language model to generate a response instruction that matches the user's intent.

[0016] Through the above technical solutions, the multimodal model can include an image processing model, an audio processing model, and a generative language model. Therefore, when processing multimodal information such as interface information, audio information, first text information, and second text information, the interface information can be processed through the image processing model, the audio information can be processed through the audio processing model, and the multimodal features can be fused through the generative language model. Thus, on the basis of achieving end-to-end visibility and speech across the entire scenario, the accuracy of understanding the driver's operating intentions can be further improved, thereby accurately instructing the vehicle terminal to execute the corresponding functional events.

[0017] In one possible implementation, the database includes a vector database and a text database; retrieving second text information matching the first text information from the database includes:

[0018] Determine the first text vector corresponding to the first text information; search for the second text vector that matches the first text vector from the vector database; obtain the second text information corresponding to the second text vector from the text database.

[0019] Through the above technical solution, second text information matching the first text information can be obtained from a preset text database, so that the second text information can be combined to determine the instruction matching the user's intention, effectively improving the accuracy of understanding the driver's operating intention, and thus accurately instructing the vehicle terminal to execute the corresponding functional event.

[0020] In one possible implementation, the user intent is the user's query intent; the first text information includes one or more keywords; the response instruction includes information related to one or more keywords.

[0021] In one possible implementation, the user intent is the user's vehicle control intent; the first text information includes one or more keywords; the operation instruction includes one or more function events corresponding to the keywords.

[0022] According to a second aspect of this application, a voice command interaction device is provided, the device comprising: a conversion unit, an acquisition unit, and a generation unit, wherein:

[0023] The conversion unit is used to receive audio information and convert it into first text information; the audio information is used to reflect the user's intent.

[0024] The acquisition unit is used to acquire second text information that matches the first text information from the database when the first text information satisfies a first condition; the first condition is used to identify whether the first text information is question-and-answer information or query information.

[0025] The generation unit is used to generate a response instruction that matches the user's intent based on the audio information, the first text information, the second text information, and the interface information of the current interface of the vehicle terminal.

[0026] In one possible implementation, the device further includes a processing unit, which is specifically used for:

[0027] If the first text information satisfies the second condition, the audio information, the first text information, and the interface information of the current interface of the vehicle terminal are input into the pre-trained multimodal model to obtain the operation instructions that match the user's intention; the second condition is used to identify the first text information as control information; the operation instructions are used to instruct the corresponding control operation to be performed on the target controlled object in the interface information.

[0028] In one possible implementation, the generating unit is specifically used for:

[0029] Audio information, interface information, first text information, and second text information are input into a pre-trained multimodal model to generate response instructions that match the user's intent.

[0030] In one possible implementation, the multimodal model includes an image processing model, an audio processing model, and a large generative language model; the generation unit is specifically used for:

[0031] The audio information is input into the audio processing model to obtain the audio feature information corresponding to the audio information; the interface information is input into the image processing model to obtain the interface feature information corresponding to the interface information; the audio feature information, interface feature information, first text information and second text information are input into the generative language model to generate a response instruction that matches the user's intent.

[0032] In one possible implementation, the database includes a vector database and a text database; the acquisition unit is specifically used for:

[0033] Determine the first text vector corresponding to the first text information; search for the second text vector that matches the first text vector from the vector database; obtain the second text information corresponding to the second text vector from the text database.

[0034] In one possible implementation, the user intent is the user's query intent; the first text information includes one or more keywords; the response instruction includes information related to one or more keywords.

[0035] In one possible implementation, the user intent is the user's vehicle control intent; the first text information includes one or more keywords; the operation instruction includes one or more function events corresponding to the keywords.

[0036] According to a third aspect provided in this application, an electronic device is provided, comprising: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to execute instructions to implement the voice command interaction method of the first aspect and any possible implementation thereof.

[0037] According to the fourth aspect provided in this application, a computer-readable storage medium is provided that, when the instructions in the computer-readable storage medium are executed by the processor of an electronic device, enables the electronic device to perform the voice command interaction method of the first aspect and any possible implementation thereof.

[0038] According to the fifth aspect provided in this application, a computer program product is provided, the computer program product including computer instructions, which, when executed on an electronic device, cause the electronic device to execute the voice command interaction method of the first aspect and any possible implementation thereof.

[0039] Therefore, the above-mentioned technical features of this application have the following beneficial effects:

[0040] (1) Upon receiving audio information, the audio information is converted into first text information. If the first text information is a question-and-answer type or query type, a response command matching the user's intent is generated by combining the audio information, the first text information, second text information matching the first text information, and the interface information of the current interface of the vehicle terminal, without the need for customized voice interaction. Therefore, the voice command interaction method provided by the embodiments of this application can effectively improve the generalization ability of voice interaction and accurately understand the driver's answer-type intent or search intent.

[0041] (2) When the first text information is control information, the operation instructions for performing corresponding control operations on the target controlled object in the interface information are obtained by combining the audio information, the first text information and the interface information of the current interface of the vehicle terminal. Therefore, the target controlled object can be directly matched and extracted from the interface information of the current interface, and the operation instructions for performing control operations on the target controlled object can be obtained, which can effectively improve the efficiency of voice interaction and quickly understand the driver's operation intentions.

[0042] (3) It can integrate multimodal features such as interface information (i.e., image information), audio information, first text information, and second text information of the current interface, and combine them with a pre-trained multimodal model to generate instructions that match the user's intent. Therefore, it can achieve end-to-end visibility and speech across the entire scenario.

[0043] (4) When processing multimodal information such as interface information, audio information, first text information and second text information, the interface information can be processed by the image processing model, the audio information can be processed by the audio processing model, and the multimodal features can be fused by the generative language model. Therefore, on the basis of realizing end-to-end visibility and speech in the whole scene, the accuracy of understanding the driver's operation intention can be further improved, and then the vehicle terminal can be accurately instructed to execute the corresponding functional events.

[0044] (5) Obtain second text information that matches the first text information from the preset text database so that the second text information can be combined to determine the instruction matching the user's intent, effectively improving the accuracy of understanding the driver's search intent or query intent, and thus accurately instructing the vehicle terminal to execute the corresponding functional event.

[0045] It should be noted that the technical effects of any of the implementation methods in aspects two through five can be found in the technical effects of the corresponding implementation methods in aspect one, and will not be repeated here.

[0046] It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and do not limit this application. Attached Figure Description

[0047] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this application and, together with the description, serve to explain the principles of this application, and do not constitute an undue limitation of this application.

[0048] Figure 1 This is a flowchart illustrating a voice command interaction method according to an exemplary embodiment;

[0049] Figure 2 This is a structural diagram illustrating an embodiment of obtaining second text information;

[0050] Figure 3 This is a structural diagram of a deterministic vector database according to an exemplary embodiment;

[0051] Figure 4 This is an interface diagram of an in-vehicle terminal according to an exemplary embodiment;

[0052] Figure 5 This is a structural diagram illustrating a voice command interaction method according to an exemplary embodiment;

[0053] Figure 6 This is a flowchart illustrating a training method for an audio processing model according to an exemplary embodiment;

[0054] Figure 7 This is a flowchart illustrating a training method for an image processing model according to an exemplary embodiment;

[0055] Figure 8 This is a block diagram illustrating a voice command interaction device according to an exemplary embodiment;

[0056] Figure 9 This is a block diagram illustrating an electronic device according to an exemplary embodiment. Detailed Implementation

[0057] To enable those skilled in the art to better understand the technical solutions of this application, the technical solutions in the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings.

[0058] It should be noted that the terms "first," "second," etc., used in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of this application described herein can be implemented in orders other than those illustrated or described herein. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this application. Rather, they are merely examples of apparatuses and methods consistent with some aspects of this application as detailed in the appended claims.

[0059] Because drivers need to focus on road conditions while driving, voice interaction has become a standard core function of smart cockpits. Currently, voice interaction is mainly achieved through a combination of two approaches: "what you see is what you speak" and "semantic recognition." In the "what you see is what you speak" approach, human-machine interface (HMI) controls are recognized and controlled by displaying text or pre-added descriptive text. In the "semantic recognition" approach, semantic recognition is typically achieved based on mainstream industry neural network algorithms. Using these algorithms allows for satisfactory accuracy even in quiet environments, meeting business needs.

[0060] Currently, the following schemes are commonly used in the semantic recognition of audio information input by drivers:

[0061] Option 1: Acquire user control voice; perform visible-and-speakable recognition on the user control voice to generate a recognition result; perform natural language understanding processing on the user control voice to generate a processing result; determine the target control command based on the recognition result and the processing result; execute the control operation corresponding to the target control command. If the user control voice is recognized as a trigger command for a control on the interface currently displayed on the vehicle's infotainment screen, execute that trigger command. If the user control voice is not a trigger command for a control on the interface currently displayed on the vehicle's infotainment screen, the user's intent can be determined based on the processing result of the natural language understanding method, and the corresponding control operation can be executed.

[0062] Option 2: Preprocess the acquired image and text data; input the preprocessed image information and text data into the image-text model to extract different modal features; align the feature data of different modalities into the same semantic space through linear mapping; input the aligned image and text features into the deep joint autoencoder model to obtain image-text multimodal features, and perform cross-attention on each layer of the deep joint autoencoder model.

[0063] As the above solutions demonstrate, existing technologies process user voice by matching the text information corresponding to the driver's input, then obtaining and executing operation instructions based on the matching results. However, current voice interaction often requires customized voice commands for long, continuous speech or specific phrases to match the corresponding instructions and realize the driver's intentions. This results in weak generalization ability of voice interaction and an inability to accurately understand the driver's operational intentions.

[0064] To address the technical problem in existing technologies where customized voice prompts are often required for interactions with continuous long speech or specific statements, this application provides a voice command interaction method. This method, upon receiving audio information, converts the audio information into first text information. If the first text information is a question-and-answer or query-type message, it combines the audio information, the first text information, second text information matching the first text information, and the interface information of the current interface of the in-vehicle terminal to generate a response command matching the user's intent, without requiring customized voice prompts for interaction. Therefore, the voice command interaction method provided by this application can effectively improve the generalization ability of voice interaction and accurately understand the driver's operational intentions.

[0065] It should be noted that the voice command interaction method provided in this application embodiment can be executed by a server or by an in-vehicle terminal. The following description uses a server as an example to illustrate the voice command interaction method provided in this application embodiment.

[0066] Figure 1This is a flowchart illustrating a voice command interaction method according to an exemplary embodiment, such as... Figure 1 As shown, the voice command interaction method includes the following steps:

[0067] S101, receive audio information and convert the audio information into first text information.

[0068] Audio information is used to reflect user intent, such as query intent, image recognition intent, or vehicle control intent.

[0069] Specifically, after receiving the user's voice input, the vehicle terminal can determine the corresponding audio information and then send the audio information to the server.

[0070] S102, if the first text information satisfies the first condition, retrieve the second text information that matches the first text information from the database.

[0071] The first condition is used to identify whether the first text information is question-and-answer type or query type. For example, when the first text information is "Search for the performance of ** car", it can be determined that the first text information is query type. When the first text information is "Is the performance of ** car *******?", it can be determined that the first text information is question-and-answer type.

[0072] In one alternative implementation, the database may include a vector database and a text database. After converting the audio information into first text information and determining that the first text information is question-and-answer type information or query type information, the server may first determine the first text vector corresponding to the first text information, search for a second text vector that matches the first text vector in the vector database, and then obtain the second text information corresponding to the second text vector from the text database.

[0073] In this embodiment of the application, the text database may include, but is not limited to, information such as person information, product information, information source information, product documents and product manuals. This embodiment of the application does not limit the content of the text database.

[0074] Specifically, after receiving audio information, the server can convert it into text information to obtain first text information. The server then identifies multiple keywords contained within the first text information. If these keywords include pre-set question-and-answer keywords, the first text information is determined to be question-and-answer type information; if they include pre-set query keywords, it is determined to be query type information. After determining whether the first text information is question-and-answer or query type information, the server can input it into a pre-trained language model to obtain the corresponding first text vector.

[0075] After obtaining the first text vector, the server can search for multiple second text vectors that match the first text vector in the vector database, and obtain the text information corresponding to the multiple second text vectors from a preset text database according to the multiple second text vectors, so as to obtain the second text information.

[0076] In the embodiments of the present application, the question-and-answer type keywords may include but are not limited to the keywords "ma", "ne", "ba", etc., and the query type keywords may include but are not limited to the keywords "query", "search", "reasoning", "recognition", etc. The embodiments of the present application do not make specific limitations on the question-and-answer type keywords and the query type keywords.

[0077] In the embodiments of the present application, the language model can be determined according to the actual situation. For example, the language model may include but is not limited to a deep learning model, such as the SBERT (Sentence Bidirectional Encoder Representation from Transformers) model. The embodiments of the present application do not make limitations on this.

[0078] Exemplarily, Figure 2 is a structural diagram of obtaining second text information shown according to an exemplary embodiment. As Figure 2 shown, after the server converts the audio information into the first text information and determines that the first text information meets the requirements of question-and-answer type information or query type information, the server can convert the first text information into a first text vector through the language model 201, and then search for n second text vectors that match the first text vector in the vector database 202, where n is an integer greater than 1.

[0079] After obtaining the n second text vectors, the text information corresponding to the n second text vectors can be obtained from the text database 203 to obtain the second text information.

[0080] In an optional implementation manner, before searching for the second text vector that matches the first text vector in the vector database, the text information in the text database can also be subjected to sentence vector embedding to obtain corresponding feature vectors, and a vector database is generated based on the feature vectors.

[0081] In the embodiments of the present application, the text database may store structured data (such as table data), or may store unstructured data (such as text data). The embodiments of the present application do not make limitations on the information stored in the text database.

[0082] Specifically, in some embodiments, before generating a vector database based on the feature vectors corresponding to the text information in the text database, the information stored in the text database, such as personnel information, product information, source information, product documents, and product manuals, can be cleaned using an Extract Transform Load (ETL) tool to obtain regularized text information.

[0083] After obtaining the regularized text information, the server can segment the regularized information to obtain multiple segmented text information. Then, using a preset language model, the multiple segmented text information are sequentially converted into segmented text vectors, and each segmented text vector is stored in a vector database.

[0084] For example, Figure 3 This is a structural diagram illustrating a deterministic vector database according to an exemplary embodiment. For example... Figure 3 As shown, the server can convert structured data (such as dictionaries, source information, knowledge bases, etc.) and unstructured data (such as product documents, function lists, etc.) contained in the text database 203 into regularized text information through ETL, and then perform segmentation operations on the regularized text information to obtain multiple segmented text information. Then, the language model 201 sequentially converts the multiple segmented text information into segmented text vectors (embedding1, embedding2, embedding3, etc.), and stores each segmented text vector in the vector database 202.

[0085] The above technical solution can embed information in the text database to obtain the corresponding feature vectors and save them to the vector database so that the second text vector corresponding to the first text vector can be quickly obtained in the future, thereby effectively improving the efficiency and accuracy of voice interaction.

[0086] S103, based on the audio information, the first text information, the second text information, and the interface information of the current interface of the vehicle terminal, generate a response instruction that matches the user's intent.

[0087] The response instruction that matches the user's intent is used to instruct the vehicle terminal to play or display the content contained in the response instruction, such as playing or displaying the text content contained in the response instruction.

[0088] In this embodiment, the interface information of the vehicle terminal's current interface can be the image information of the current interface sent to the server by the vehicle terminal after detecting a change in the current interface, or it can be the image information of the current interface obtained by the server from the vehicle terminal. This embodiment does not specifically limit the method of obtaining the interface information of the vehicle terminal's current interface.

[0089] In one optional implementation, after obtaining the second text information matching the first text information in S102, the server can input the audio information, interface information, first text information and second text information into a pre-trained multimodal model to generate a response instruction that matches the user's intent.

[0090] Multimodal models are a product of multimodal technology. Multimodal technology aligns the feature spaces of multiple modalities, enabling the fusion of information from various modalities, including text, images, video, and audio. For example, multimodal technology can be used to fuse audio and text information; or, it can be used to fuse audio, image, and text information.

[0091] Multimodal technologies have played a significant role in various fields, such as natural language generation, visual question answering, and intelligent recommendation. The following sections will introduce multimodal technologies in the areas of natural language generation, visual question answering, and intelligent recommendation, respectively.

[0092] (I) Natural Language Generation

[0093] In the field of natural language generation, multimodal techniques can be applied to image text descriptions, video text descriptions, and combined image and text text descriptions.

[0094] In image-text description, multimodal models can be used to recognize and describe images, generating natural language text associated with the image. For example, given an image of a cat, it can generate "A black cat is looking at the camera with big eyes."

[0095] In video text description, multimodal models can be used to integrate image and audio information to generate natural language text related to the video. For example, given a video clip, a description like "a girl is picking flowers in the forest" can be generated.

[0096] In text descriptions that combine text and images, multimodal models can be used to fuse textual and image information, generating more specific and vivid natural language descriptions. For example, in scenarios such as product recommendations, product images can be combined with text describing the product to generate more engaging and specific recommendation text.

[0097] (ii) Visual Question Answering

[0098] In visual question answering, multimodal technology can take into account the information contained in the image, such as objects, colors, textures and shapes, as well as the context and perspective of the question, so as to answer questions more accurately, generate descriptions and understand events.

[0099] (III) Intelligent Recommendation Aspects

[0100] In terms of intelligent recommendations, multimodal technology can better understand users' interests and needs, and provide more personalized recommendation services.

[0101] In this application embodiment, the multimodal model may include, but is not limited to, the GPT-4 model, the CLIP model, and the BLIP-2 model. This application embodiment does not specifically limit the multimodal model.

[0102] Specifically, in some embodiments, after the server inputs audio information, interface information of the current interface of the vehicle terminal, first text information, and second text information into a pre-trained multimodal model, the multimodal model can sequentially determine the audio feature vector corresponding to the audio information, the interface feature vector corresponding to the interface information, the first text feature vector corresponding to the first text information, and the second text feature vector corresponding to the second text information. The model then performs fusion processing on the audio feature vector, interface feature vector (or image feature vector), first text feature vector, and second text feature vector, outputting a target protocol in a preset format. After obtaining the target protocol, the server can send it to the vehicle terminal, which can then execute corresponding instructions according to the target protocol.

[0103] In this embodiment of the application, the preset format can be set according to the actual situation. For example, the preset format can be set to JSON format so that the target protocol contains the domain concept and the strong assignment intent (intent or slot).

[0104] For example, in one embodiment, assuming the user inputs the audio "search for the person information in the top left corner", the output target protocol can be: {"domain": "viewcmd", "slot": {"insType": "picture_describe", "results": "***"}}. Here, "domain": "viewcmd" represents the image control domain; "slot": {"insType": "picture_describe", "results": "***"} indicates that the person information in the top left corner is: ***. Here, *** represents the content of the searched person information in the top left corner.

[0105] In another embodiment, assuming the user input audio is "Translate the subtitles in the image," the output target protocol can be: {"domain": "viewcmd", "slot": {"insType": "picture_translate", "results": "***"}}. Here, "slot": {"insType": "picture_translate", "results": "***"} indicates that the image subtitles are translated to: ***. Here, *** represents the translated content of the subtitles in the image.

[0106] The above technical solution integrates multimodal features such as interface information, audio information, first text information, and second text information from the current interface of the vehicle terminal, and combines them with a pre-trained multimodal model to generate instructions that match the user's intent. Compared to existing technologies that can only understand text information and cannot understand image information, the above technical solution can understand both text and image information, thus expanding the application scope of the "visible and speakable" technology. This allows the technology to perform complex interactions with images, such as image object recognition, image search and recommendation, image reasoning, and image translation. Furthermore, in this embodiment, only one multimodal model is used to fuse image information, audio information, first text information, and second text information, which reduces information loss during processing, thus enabling end-to-end "visible and speakable" functionality across all scenarios.

[0107] In one optional implementation, based on the above-described generation of instructions matching user intent using a multimodal model, the multimodal model may include an audio processing model, an image processing model, and a generative language model. In the process of generating a response instruction matching user intent based on audio information, interface information, first text information, and second text information, the audio information can be input into the audio processing model to obtain audio feature information corresponding to the audio information; the interface information can be input into the image processing model to obtain interface feature information corresponding to the interface information; and the audio feature information, interface feature information, first text information, and second text information can be input into the generative language model to generate a response instruction matching user intent.

[0108] The interface feature information may include displayed text information on the current interface of the vehicle terminal, and descriptive text information for each controlled control on the current interface of the vehicle terminal. Displayed text information refers to the text information displayed on the current interface of the vehicle terminal, such as song titles, video titles, etc. Descriptive text information for each controlled control refers to the descriptive text information for each operable control on the current interface of the vehicle terminal, such as... Figure 4The description text of control 401 is "back control", the description text of control 402 is "pause / play control", and the description text of control 403 is "forward control".

[0109] In this embodiment, the audio processing model can be any pre-trained model of Automatic Speech Recognition (ASR), such as the Wav2vec model, the HuBERT model, the data2vec model, etc. This embodiment does not specifically limit the audio processing model.

[0110] In this embodiment, the image processing model can be any pre-trained image model, such as the AlexNet model, the ResNet-50 model, etc. This embodiment does not specifically limit the image processing model.

[0111] Specifically, in some embodiments, the server can input audio information into an audio processing model to obtain an initial audio vector corresponding to the audio information, and then convert the initial audio vector into an audio feature vector through a pre-set audio vector alignment layer. Similarly, the server can input the interface information of the current interface of the vehicle terminal into an image processing model to obtain an initial interface vector corresponding to the interface information, and then convert the initial interface vector into an interface feature vector through a pre-set image vector alignment layer.

[0112] After obtaining the audio feature vector and the interface feature vector, the audio feature vector, the interface feature vector, the first text information, and the second text information can be input into the generative language model. The generative language model can sequentially determine the first text feature vector corresponding to the first text information and the second text feature vector corresponding to the second text information, and perform fusion processing on the audio feature vector, the interface feature vector, the first text feature vector, and the second text feature vector to output a target protocol in a preset format. After obtaining the target protocol, the server can send the target protocol to the vehicle terminal, and the vehicle terminal can execute the corresponding instructions according to the target protocol.

[0113] The audio initial vector and audio feature vector both contain audio feature information; the interface initial vector and interface feature vector both contain interface feature information.

[0114] Through the above technical solution, a generative language model can be used to fuse and understand audio feature vectors, interface feature vectors, first text feature vectors, and second text feature vectors to generate instructions that match the user's intent. Compared to existing technologies where users can only use customized text for voice interaction, the generative language model used in this application can fuse and understand information from multiple input modalities, allowing for voice interaction using any form of text and achieving automatic generalization of text extensions. This eliminates the need for manual adaptation of the voice interaction text, thus reducing the workload of manual adaptation and effectively improving voice interaction efficiency.

[0115] In an optional implementation, after converting the audio information into first text information in S101, the method further includes: if the first text information satisfies the second condition, inputting the audio information, the first text information, and the interface information of the current interface of the vehicle terminal into a pre-trained multimodal model to obtain an operation instruction that matches the user's intent.

[0116] The second condition is used to identify the first text information as control information; the operation instruction is used to instruct the corresponding control operation to be performed on the target controlled object in the interface information.

[0117] In this embodiment of the application, the operation instruction can be a click operation on a target control, such as a click operation on a target control. Figure 4 The click operation of control 402 shown can also be an operation performed on a target object, such as opening a car window.

[0118] In some embodiments, after receiving audio information, the server can convert the audio information into text information to obtain first text information, and determine multiple keywords contained in the first text information. If the multiple keywords do not contain pre-set question-and-answer keywords and query keywords, the first text information is determined to be control information. The server can input the audio information, the first text information, and the interface information of the current interface of the vehicle terminal into a pre-trained multimodal model to obtain instructions for the target controlled object (i.e., operation instructions matching the user's intent).

[0119] Specifically, the multimodal model can include an audio processing model, an image processing model, and a generative language model. In the process of obtaining instructions matching the user's intent based on audio information, initial text information, and the interface information of the current interface of the vehicle terminal, the audio information can be input into the audio processing model to obtain the corresponding audio feature information. Similarly, the interface information of the current interface of the vehicle terminal can be input into the image processing model to obtain the corresponding interface feature information.

[0120] After obtaining the audio feature information and interface feature information, the server can input the audio feature information, the first text information, and the interface feature information into the generative language big model to determine the target controlled object among the multiple controlled objects included in the current interface of the vehicle terminal, and obtain the operation instructions for the target controlled object.

[0121] For example, in one embodiment, assuming the current interface of the vehicle terminal is a video playback paused interface, such as... Figure 4 As shown, the controlled objects included in the video play / pause interface are: a back control 401, a pause / play control 402, and a forward control 403. The user inputs the audio information "play". After receiving this audio information, the server can convert it into text information to obtain the first text information. If the first text information does not contain question-and-answer keywords or query keywords, it can be determined that the first text information is control information.

[0122] After determining that the first text information is control information, the server can input the audio information into the audio processing model to obtain the corresponding audio feature information. Figure 4 The current interface of the vehicle terminal shown (video playback pause interface) is input into the image processing model to obtain the interface feature information corresponding to the interface information. The interface feature information may include the description text information of control 401 "back control", the description text information of control 402 "pause / play control" and the description text information of control 403 "forward control".

[0123] After obtaining the first text information, audio feature information, and interface feature information corresponding to the audio information, the server can input these three information into the generative language model. Based on the first text information and audio feature information, the server can match the target controlled object, "Pause / Play Control 402," from the interface feature information and determine the operation instruction for the target controlled object. This operation instruction is used to instruct the vehicle terminal to perform a click operation on the target controlled object, "Pause / Play Control 402."

[0124] In another embodiment, assuming the current interface of the vehicle terminal is a song selection interface, and this song selection interface contains the names of multiple songs, such as "**Chapter Seven**", "**The Moon** Rises**", "Reverse**", "**Arrives**", etc., the user inputs the audio information "Chapter Seven". After receiving the audio information, the server can convert the audio information into text information to obtain the first text information. The first text information does not contain question-and-answer keywords or query keywords, so the first text information can be identified as control information.

[0125] After determining that the first text information is control information, the server can input the audio information into the audio processing model to obtain the corresponding audio feature information. The interface information of the current interface (song selection interface) of the vehicle terminal is then input into the image processing model to obtain the corresponding interface feature information. This interface feature information may include the display text information of each song title, such as "**Chapter Seven**", "**The Moon** Rises**", "Reverse**", "**Arrives**".

[0126] After obtaining the first text information, audio feature information, and interface feature information corresponding to the audio information, the server can input these three information into the generative language model. Based on the first text information and audio feature information, the server can match the target controlled object "**Chapter Seven" from the interface feature information and determine the operation instruction for the target controlled object. This operation instruction is used to instruct the vehicle terminal to perform a click operation on the target controlled object "**Chapter Seven".

[0127] The above technical solution allows for the generation of instructions to perform the operation corresponding to the keyword on the target controlled object, provided that the first text information contains only one keyword, and that keyword is a verb. This is achieved by combining audio information, the first text information, and the interface information of the current interface of the vehicle terminal (including information on multiple controlled objects). Therefore, the target controlled object can be directly matched from the interface information of the current interface, enabling the vehicle terminal to directly perform the operation corresponding to the keyword on the target controlled object. This allows for rapid understanding of the driver's operational intentions and effectively improves the efficiency of voice interaction.

[0128] Figure 5 A structural diagram illustrating a voice command interaction method as an exemplary embodiment is shown below. Figure 5 As shown, after receiving audio information, the server can input the audio information into the audio processing model 503 to obtain the initial audio vector, and then input the initial audio vector into the audio vector alignment layer 504 to obtain the audio feature vector. The interface information of the current interface of the vehicle terminal is input into the image processing model 501 to obtain the initial interface vector, and then input the initial interface vector into the image vector alignment layer 502 to obtain the interface feature vector.

[0129] The audio information is converted into text information to obtain the first text information. After determining whether the first text information is question-and-answer information or query information, the first text vector corresponding to the first text information can be determined. Then, one or more text vectors that match the first text vector are searched from the vector database to obtain the second text vector. After obtaining the second text vector, the server can obtain the second text information corresponding to the second text vector from the text database.

[0130] After obtaining the interface feature vector, audio feature vector, first text information, and second text information through the above method, the server can input the interface feature vector, audio feature vector, first text information, and second text information into the generative language large model 505 to obtain the target protocol in a preset format.

[0131] In one alternative implementation, the user intent is the user's query intent; the first text information includes one or more keywords; the content included in the response instruction includes information related to one or more keywords, that is, the response instruction is used to instruct the vehicle terminal to play and / or display information related to one or more keywords.

[0132] For example, in one embodiment, assuming the user's intent is to "query the vehicle infotainment system status of a vehicle of brand XX and model XX", the first text information may include the keywords "query", "brand XX", "model XX", "vehicle", and "vehicle infotainment system status". Correspondingly, the response instruction includes information related to the keywords "query", "brand XX", "model XX", "vehicle", and "vehicle infotainment system status". The response instruction is used to instruct the in-vehicle terminal to play and / or display information related to the keywords "query", "brand XX", "model XX", "vehicle", and "vehicle infotainment system status".

[0133] In one optional implementation, the user intent is the user's vehicle control intent; the first text information includes one or more keywords; the operation instruction includes one or more functional events corresponding to the keywords, that is, the operation instruction is used to instruct the vehicle terminal to execute one or more functional events corresponding to the keywords.

[0134] For example, in one embodiment, assuming the user's vehicle control intention is "open the window", the first text information may include the keywords "open" and "window". Accordingly, the operation instruction is used to instruct the vehicle terminal to perform the function event corresponding to the keywords "open" and "window", that is, to open the window.

[0135] The above technical solutions can transform the "See and Speak" feature from text recognition on the screen to multimodal recognition. By combining generative language models and text databases, the accuracy of image and speech understanding is greatly improved, thereby expanding the "See and Speak" scenario from a single software command scenario to a full-domain chat and control scenario.

[0136] The "see-and-say" functionality, which integrates information from multiple modalities, represents a qualitative leap in higher-order intellectual behaviors, including but not limited to reasoning, planning, abstract thinking, self-reflection and error correction, and understanding complex intentions. Furthermore, due to the embedded knowledge base, it can integrate more precise information, avoiding the generation of incorrect answers. Users can interact freely with the human-computer interaction screen, for example:

[0137] 1. Imagine a user is playing a complex role-playing game. Using the "see it, say it" feature, the user can easily find character information, learn about in-game quests and items, and plan their next move.

[0138] 2. Suppose the user is a parent who wants to help their elementary school student learn English. Using the "See and Speak" feature, the user can quickly find interesting English learning resources, such as videos, images, and games. Relevant videos can be played, and an interactive question-and-answer mode can help the child reinforce their knowledge.

[0139] 3. Imagine a user is watching a movie or TV series but is confused by the descriptions of certain characters or scenes. Using the "See and Say" feature, users can find relevant information to gain a deeper understanding of the characters, plot, and background of the film or TV series, thus better comprehending and appreciating the work.

[0140] 4. Imagine a user is an entrepreneur who needs to make important business decisions. Using the "See and Say" feature, users can find relevant data, reports, and case studies to summarize market trends, competitors, and customer needs. Users can also plan and forecast to make more informed business decisions.

[0141] Before using the image processing model, audio processing model, and generative language large model in the multimodal model described above, these models can also be trained separately. Specifically, the training process for the image processing model, audio processing model, and generative language large model will be described below.

[0142] Figure 6 This is a flowchart illustrating a training method for an audio processing model according to an exemplary embodiment, such as... Figure 6 As shown, the training method includes the following steps:

[0143] S601, Obtain the initial audio-text pair dataset, and preprocess the audio data in the initial audio-text pair dataset to obtain the training dataset.

[0144] Specifically, technicians can first collect a large amount of audio information and the corresponding text information to form an initial audio-text pair dataset, and then send the initial audio-text pair dataset to the server through the user terminal. After receiving the initial audio-text pair dataset sent by the terminal, the server can preprocess the audio data in the initial audio-text pair dataset to obtain the training dataset.

[0145] In this embodiment, the preprocessing operation may include, but is not limited to, noise reduction, segmentation, alignment, etc. This embodiment does not limit the preprocessing operation.

[0146] S602 uses a preset feature extraction algorithm to extract audio feature vectors and corresponding text feature vectors from the training dataset.

[0147] Specifically, after obtaining the training dataset through S601, the server can extract audio feature vectors representing acoustic features and text feature vectors corresponding to each audio feature vector from the training dataset.

[0148] In this application embodiment, the preset feature extraction algorithm may include, but is not limited to, the Mel-Frequency Cepstral Coefficients (MFCC) algorithm and the Filter Banks (FBANK) algorithm. This application embodiment does not limit the feature extraction algorithm.

[0149] S603 uses audio feature vectors and corresponding text feature vectors to train and optimize the initial audio processing model, resulting in the trained initial audio processing model.

[0150] In this embodiment, the initial audio processing model can be any end-to-end neural network model, such as the DeepSpeech model, the CTC model, etc. This embodiment does not limit the initial audio processing model.

[0151] Specifically, the initial audio processing model can be trained using audio feature vectors and their corresponding text feature vectors, and the model parameters can be continuously optimized to obtain the trained initial audio processing model.

[0152] S604. Obtain the test dataset and use it to test and evaluate the trained initial audio processing model to obtain the audio processing model.

[0153] Specifically, the test dataset can be input into the trained initial audio processing model to obtain the output result, and the output result can be compared with the standard result to evaluate the trained initial audio processing model. Based on the evaluation result, the parameters of the trained initial audio processing model can be adjusted to improve the performance and accuracy of the trained initial audio processing model, and the final audio processing model can be obtained.

[0154] After training the audio processing model using the above technical solution, the audio processing model can learn the correspondence between audio data and text, as well as the correspondence between audio feature vectors and text feature vectors. In this way, it can obtain the standard text information corresponding to the audio data and the text feature vector in the text space during use.

[0155] Figure 7This is a flowchart illustrating a training method for an image processing model according to an exemplary embodiment, such as... Figure 7 As shown, the training method includes the following steps:

[0156] S701, Obtain the initial image-text pair dataset, and preprocess the image data in the initial image-text pair dataset to obtain the training dataset.

[0157] Specifically, technicians can first collect a large amount of image information and the corresponding text information to form an initial image-text pair dataset, and then send the initial image-text pair dataset to the server through the user terminal.

[0158] In this embodiment of the application, the text information corresponding to the image information may include one or more. For example, the text information corresponding to the image information may include image display information, image translation information, image reasoning information, etc. This embodiment of the application does not limit the number of text information corresponding to the image information.

[0159] The S702 uses an image encoder to extract image feature vectors and corresponding text feature vectors from the training dataset.

[0160] Specifically, after obtaining the training dataset through S701, the server can extract image feature vectors through the image encoder and extract the text feature vectors corresponding to each image feature vector.

[0161] S703 uses image feature vectors and corresponding text feature vectors to train and optimize the initial image processing model, resulting in the trained initial image processing model.

[0162] Specifically, the initial image processing model can be trained using image feature vectors and corresponding text feature vectors, and the model parameters can be continuously optimized to obtain the trained initial image processing model.

[0163] S704. Obtain the test dataset and use it to test and evaluate the trained initial image processing model to obtain the image processing model.

[0164] Specifically, the test dataset can be input into the trained initial image processing model to obtain the output result, and the output result can be compared with the standard result to evaluate the trained initial image processing model. Based on the evaluation result, the parameters of the trained initial image processing model can be adjusted to improve the performance and accuracy of the trained initial image processing model, and the final image processing model can be obtained.

[0165] During the training of a generative language model, an initial generative language model can be obtained first, and the output format and output semantics of the initial generative language model can be adjusted through a preset model adjustment method to make the adjusted generative language model.

[0166] In this embodiment, the preset model adjustment method can be P-Turning, LoRA, Prompt-Tuning, or Prefix-Tuning. This embodiment does not limit the model adjustment method.

[0167] In this embodiment of the application, the output format and output semantics of the generative language big model can be set according to the actual situation. For example, the output format of the generative language big model can be set to JSON format, and the domain concept and strong assignment intent can be set in the output semantics of the generative language big model.

[0168] For example, in one embodiment, assuming the input to the generative language model is "open the car window", the output of the unadjusted generative language model (i.e., the initial generative language model) could be "Sorry, as a language AI assistant, I cannot directly open the car window. However, please note that while driving, you must abide by traffic rules and safety guidelines. Never open the car door or window while driving." The output of the adjusted generative language model could be "{"domain": "aircontrol", "intent": ["action.open", "device.window"]}". Here, "domain" represents the domain, "domain": "aircontrol" indicates that the domain is the control domain, "intent" represents the intention, "device" represents the device, and "intent": ["action.open", "device.window"] means opening the window.

[0169] After training an image processing model that can convert image information into initial image vectors, an audio processing model that can convert audio information into audio feature vectors, and a generative language model that can fuse information from multiple modalities to obtain a target protocol recognizable by the vehicle terminal, an image vector alignment layer can be added after the image processing model, and an audio vector alignment layer can be added after the audio processing model. This converts the initial image vectors into image feature vectors that the generative language model can understand, and the initial audio vectors into audio feature vectors that the generative language model can understand. This allows the generative language model to directly fuse the image and audio feature vectors after inputting them, thereby generating the corresponding target protocol.

[0170] The foregoing mainly describes the solutions provided by the embodiments of this application from a methodological perspective. To achieve the above functions, the voice command interaction device or electronic device includes hardware structures and / or software modules corresponding to the execution of each function. Those skilled in the art should readily recognize that, based on the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein, this application can be implemented in hardware or a combination of hardware and computer software. Whether a function is executed in hardware or by computer software driving hardware depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.

[0171] This application embodiment can, based on the above method, exemplarily divide a voice command interaction device or electronic device into functional modules. For example, the voice command interaction device or electronic device may include functional modules corresponding to each functional division, or two or more functions may be integrated into one processing module. The integrated module can be implemented in hardware or as a software functional module. It should be noted that the module division in this application embodiment is illustrative and only represents one logical functional division; in actual implementation, there may be other division methods.

[0172] Figure 8 This is a block diagram illustrating a voice command interaction device 800 according to an exemplary embodiment. (Refer to...) Figure 8 The voice command interaction device 800 includes:

[0173] The conversion unit 801 is used to receive audio information and convert the audio information into first text information; the audio information is used to reflect the user's intent.

[0174] The acquisition unit 802 is used to acquire second text information matching the first text information from the database when the first text information satisfies the first condition; the first condition is used to identify whether the first text information is question-and-answer information or query information.

[0175] The generation unit 803 is used to generate a response instruction that matches the user's intent based on the audio information, the first text information, the second text information, and the interface information of the current interface of the vehicle terminal.

[0176] In one possible implementation, the device further includes a processing unit, which is specifically used for:

[0177] If the first text information satisfies the second condition, the audio information, the first text information, and the interface information of the current interface of the vehicle terminal are input into the pre-trained multimodal model to obtain the operation instructions that match the user's intention; the second condition is used to identify the first text information as control information; the operation instructions are used to instruct the corresponding control operation to be performed on the target controlled object in the interface information.

[0178] In one possible implementation, the generating unit 803 is specifically used for:

[0179] Audio information, interface information, first text information, and second text information are input into a pre-trained multimodal model to generate response instructions that match the user's intent.

[0180] In one possible implementation, the multimodal model includes an image processing model, an audio processing model, and a generative language large model; the generation unit 803 is specifically used for:

[0181] The audio information is input into the audio processing model to obtain the audio feature information corresponding to the audio information; the interface information is input into the image processing model to obtain the interface feature information corresponding to the interface information; the audio feature information, interface feature information, first text information and second text information are input into the generative language model to generate a response instruction that matches the user's intent.

[0182] In one possible implementation, the database includes a vector database and a text database; the acquisition unit 802 is specifically used for:

[0183] Determine the first text vector corresponding to the first text information; search for the second text vector that matches the first text vector from the vector database; obtain the second text information corresponding to the second text vector from the text database.

[0184] In one possible implementation, the user intent is the user's query intent; the first text information includes one or more keywords; the response instruction includes information related to one or more keywords.

[0185] In one possible implementation, the user intent is the user's vehicle control intent; the first text information includes one or more keywords; the operation instruction includes one or more function events corresponding to the keywords.

[0186] The voice command interaction device provided in this application embodiment can convert audio information into first text information after receiving audio information. If the first text information is a question-and-answer type or query type, it can combine the audio information, the first text information, the second text information that matches the first text information, and the interface information of the current interface of the vehicle terminal to generate a response command that matches the user's intent. This can effectively improve the generalization ability of voice interaction and accurately understand the driver's operation intent.

[0187] Regarding the apparatus in the above embodiments, the specific manner in which each module performs its operation has been described in detail in the embodiments related to the method, and will not be elaborated upon here.

[0188] Figure 9 This is a block diagram illustrating an electronic device according to an exemplary embodiment. Figure 9 As shown, the electronic device 900 includes, but is not limited to, a processor 901 and a memory 902.

[0189] The memory 902 described above is used to store the executable instructions of the processor 901. It is understood that the processor 901 is configured to execute instructions to implement the voice command interaction method in the above embodiments.

[0190] It should be noted that those skilled in the art will understand that Figure 9 The electronic device structure shown does not constitute a limitation on the electronic device; the electronic device may include, but is not limited to, other electronic devices. Figure 9 This may indicate more or fewer components, or combinations of certain components, or different component arrangements.

[0191] Processor 901 is the control center of the electronic device. It connects various parts of the electronic device via various interfaces and lines. By running or executing software programs and / or modules stored in memory 902, and by calling data stored in memory 902, it performs various functions and processes data, thereby providing overall monitoring of the electronic device. Processor 901 may include one or more processing units. Optionally, processor 901 may integrate an application processor and a modem processor. The application processor mainly handles the operating system, user interface, and applications, while the modem processor mainly handles wireless communication. It is understood that the modem processor may not be integrated into processor 901.

[0192] The memory 902 can be used to store software programs and various data. The memory 902 may primarily include a program storage area and a data storage area. The program storage area may store the operating system, application programs required by at least one functional module (such as a determination unit, processing unit, etc.), etc. Furthermore, the memory 902 may include high-speed random access memory, and may also include non-volatile memory, such as at least one disk storage device, flash memory device, or other volatile solid-state storage device.

[0193] In an exemplary embodiment, a computer-readable storage medium including instructions is also provided, such as a memory 902 including instructions, which can be executed by a processor 901 of an electronic device 900 to implement the methods in the above embodiments.

[0194] In actual implementation, Figure 8 The conversion unit 801, acquisition unit 802, and generation unit 803 can all be generated by... Figure 9 The processor 901 calls the computer program stored in the memory 902 to implement the process. The specific execution process can be found in the description of the method section in the previous embodiment, and will not be repeated here.

[0195] Optionally, the computer-readable storage medium may be a non-transitory computer-readable storage medium, such as a read-only memory (ROM), random access memory (RAM), CD-ROM, magnetic tape, floppy disk, and optical data storage device.

[0196] In an exemplary embodiment, this application also provides a computer program product including one or more instructions, which can be executed by a processor 901 of an electronic device to perform the methods described above.

[0197] It should be noted that when one or more instructions in the computer-readable storage medium or computer program product are executed by the processor of an electronic device, they implement the various processes of the above method embodiments and achieve the same technical effect as the above method. To avoid repetition, they will not be described again here.

[0198] Through the above description of the embodiments, those skilled in the art can clearly understand that, for the sake of convenience and brevity, only the division of the above functional modules is used as an example. In actual applications, the above functions can be assigned to different functional modules as needed, that is, the internal structure of the device can be divided into different functional modules to complete all or part of the functions described above.

[0199] In the several embodiments provided in this application, it should be understood that the disclosed apparatus and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of modules or units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another apparatus, or some features may be ignored or not executed. Furthermore, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between apparatuses or units may be electrical, mechanical, or other forms.

[0200] The units described as separate components may or may not be physically separate. A component shown as a unit can be one or more physical units; that is, it can be located in one place or distributed in multiple different locations. Some or all of the classified units can be selected to achieve the purpose of this embodiment, depending on actual needs.

[0201] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.

[0202] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a readable storage medium. Based on this understanding, the technical solution of the embodiments of this application, essentially, or the part that contributes to the prior art, or a complete or partial classification of the technical solution, can be embodied in the form of a software product. This software product is stored in a storage medium and includes several instructions to cause a device (which may be a microcontroller, chip, etc.) or processor to execute all or part of the steps of the methods of the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, ROM, RAM, magnetic disks, or optical disks.

[0203] The above are merely specific embodiments of this application, but the scope of protection of this application is not limited thereto. Any changes or substitutions within the technical scope disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.

Claims

1. A voice command interaction method, characterized in that, The method includes: Receive audio information and convert the audio information into first text information; the audio information is used to reflect the user's intent. If the first text information satisfies the first condition, the second text information matching the first text information is retrieved from the database; the first condition is used to identify whether the first text information is question-and-answer information or query information. The audio information, the interface information of the current interface of the vehicle terminal, the first text information, and the second text information are input into a pre-trained multimodal model to generate a response instruction matching the user's intent; wherein, the multimodal model includes an image processing model, an audio processing model, and a generative language model; the generation of the response instruction matching the user's intent includes: The audio information is input into the audio processing model to obtain the audio feature information corresponding to the audio information; The interface information is input into the image processing model to obtain the interface feature information corresponding to the interface information; The audio feature information, the interface feature information, the first text information, and the second text information are input into the generative language model to generate a response instruction that matches the user's intent.

2. The method according to claim 1, characterized in that, The method further includes: When the first text information satisfies the second condition, the audio information, the first text information, and the interface information of the current interface of the vehicle terminal are input into a pre-trained multimodal model to obtain an operation instruction that matches the user's intent; the second condition is used to identify the first text information as control information; the operation instruction is used to instruct the corresponding control operation to be performed on the target controlled object in the interface information.

3. The method according to claim 1, characterized in that, The database includes a vector database and a text database; The step of retrieving the second text information matching the first text information from the database includes: Determine the first text vector corresponding to the first text information; Search the vector database for a second text vector that matches the first text vector; Obtain the second text information corresponding to the second text vector from the text database.

4. The method according to any one of claims 1-3, characterized in that, The user intent is the user's query intent; the first text information includes one or more keywords; the content included in the reply instruction includes information related to the one or more keywords.

5. The method according to claim 2, characterized in that, The user intent is the user's vehicle control intent; the first text information includes one or more keywords; the operation instruction includes the function event corresponding to the one or more keywords.

6. A voice command interaction device, characterized in that, The device includes: a conversion unit, an acquisition unit, and a generation unit, wherein: The conversion unit is used to receive audio information and convert the audio information into first text information; the audio information is used to reflect the user's intent. The acquisition unit is used to acquire second text information matching the first text information from the database when the first text information satisfies a first condition; the first condition is used to identify whether the first text information is question-and-answer information or query information. The generation unit is used to input the audio information, the interface information of the current interface of the vehicle terminal, the first text information, and the second text information into a pre-trained multimodal model to generate a response instruction matching the user's intent. The multimodal model includes an image processing model, an audio processing model, and a generative language model. Generating a response instruction matching the user's intent includes: inputting the audio information into the audio processing model to obtain audio feature information corresponding to the audio information; inputting the interface information into the image processing model to obtain interface feature information corresponding to the interface information; and inputting the audio feature information, the interface feature information, the first text information, and the second text information into the generative language model to generate a response instruction matching the user's intent.

7. The apparatus according to claim 6, characterized in that, The device further includes a processing unit, which is specifically used for: When the first text information satisfies the second condition, the audio information, the first text information, and the interface information of the current interface of the vehicle terminal are input into a pre-trained multimodal model to obtain an operation instruction that matches the user's intent; the second condition is used to identify the first text information as control information; the operation instruction is used to instruct the corresponding control operation to be performed on the target controlled object in the interface information.

8. An electronic device, characterized in that, include: processor; Memory used to store the processor's executable instructions; The processor is configured to execute the instructions to implement the voice command interaction method as described in any one of claims 1-5.

9. A computer-readable storage medium, characterized in that, When the computer-executable instructions stored in the computer-readable storage medium are executed by the processor of the electronic device, the electronic device is capable of executing the voice command interaction method as described in any one of claims 1-5.