A vehicle-mounted voice personalized interaction method, device, equipment and medium

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By generating personalized voice parameters through multimodal identity recognition and historical records, the problem of limited interaction methods in in-vehicle voice interaction systems has been solved, improving user experience and interaction efficiency, especially for the convenience of use by the elderly and children.

CN122245326APending Publication Date: 2026-06-19VOYAH AUTOMOBILE TECH CO LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: VOYAH AUTOMOBILE TECH CO LTD
Filing Date: 2026-02-25
Publication Date: 2026-06-19

Application Information

Patent Timeline

25 Feb 2026

Application

19 Jun 2026

Publication

CN122245326A

IPC: G10L17/22; G10L17/10; G10L17/02; G10L17/04; G10L15/25; G10L15/07; G10L15/183; G10L15/22; G06F3/16; G06F9/451

AI Tagging

Application Domain

Sound input/output Speech recognition

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing in-vehicle voice interaction systems lack a mapping relationship between user identification and historical behavior data, resulting in a single and rigid interaction method, which affects the user experience, especially for the elderly and children, where the usage threshold is relatively high.

Method used

By using multimodal identity recognition technology (facial image and voiceprint recognition) to identify users, and generating personalized voice interaction parameters based on historical interaction records, including speech rate, timbre and tone, customized voice output can be achieved.

Benefits of technology

It improves the flexibility and user-friendliness of interaction, lowers the usage threshold for special groups, and enhances interaction efficiency and the personalization of user experience.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122245326A_ABST

Patent Text Reader

Abstract

This application discloses a method, apparatus, and device for personalized voice interaction in vehicles. The method includes: acquiring a user's identity information; acquiring the user's historical voice interaction records based on the user's identity information; wherein the identity information is obtained based on facial image information and voice information recognition; generating personalized voice interaction parameters that conform to the user's interaction preferences based on the identity information and historical voice interaction records; wherein the parameters include speech rate, timbre, and pitch; and outputting an interactive voice stream based on the personalized voice interaction parameters and interaction content. This application achieves a personalized voice interaction experience for different user interaction preferences by recognizing identity information based on facial image information and voice information and generating personalized voice interaction parameters in combination with historical voice interaction records, significantly improving the naturalness, comfort, and user satisfaction of the interaction.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of in-vehicle intelligent voice interaction technology, and in particular to an in-vehicle personalized voice interaction method, device, equipment and medium. Background Technology

[0002] In-vehicle voice interaction is the core human-machine interaction interface of the smart cockpit, and its service quality directly affects the driving experience and interaction efficiency.

[0003] Currently, most domestic vehicle manufacturers still use a generic voice response mode in their in-vehicle voice interaction systems. This results in a rigid and monotonous interaction method, which can make users feel mechanical and unnatural, thus affecting the overall driving experience. Summary of the Invention

[0004] This application provides a method, device, equipment, and medium for personalized in-vehicle voice interaction, which addresses the problem of the single and fixed nature of existing in-vehicle voice interaction methods.

[0005] The technical solution adopted in this application is as follows:

[0006] Firstly, this application provides a method for personalized voice interaction in vehicles, including: The system obtains the user's identity information and retrieves the user's historical voice interaction records based on that information. The identity information is obtained by recognizing the user's facial image and voice information. Based on identity information and historical voice interaction records, personalized voice interaction parameters that match the user's interaction preferences are generated; these parameters include speech rate, timbre, and pitch. The interactive voice stream is output based on personalized voice interaction parameters and interactive content; wherein, the interactive content is generated based on at least one of the user's current voice data, historical voice interaction records and preset system voice.

[0007] In one alternative to the first aspect, the identity information is obtained based on the following method: Collect users' facial image and voice information, and extract the corresponding facial image features and voiceprint features; The first identity information is determined based on facial image features and a preset user database, and the second identity information is determined based on voiceprint features and a preset user database. If the first identity information matches the second identity information, the identification is considered successful and either the first or second identity information is used as the identity information.

[0008] In one alternative to the first aspect, the method further includes: If the first identity information and the second identity information are inconsistent, the identification is considered to have failed. Add the current user's identity information to the preset user database and store the current user's voice interaction records.

[0009] In one alternative to the first aspect, the method further includes: If the first identity information cannot be determined based on facial image features and the preset user database, or if the second identity information cannot be determined based on voiceprint features and the preset user database, then the recognition is considered to have failed. Add the current user's identity information to the preset user database and store the current user's voice interaction records.

[0010] In one alternative of the first aspect, personalized voice interaction parameters that conform to the user's interaction preferences are generated based on identity information and historical voice interaction records, including: Based on identity information, historical preference data of users is extracted from historical voice interaction records; wherein, historical preference data includes historical settings records and / or preference statistics of users for at least one of speech rate, timbre, and pitch; Personalized voice interaction parameters are generated based on historical preference data.

[0011] In one alternative of the first aspect, an interactive voice stream is output based on personalized voice interaction parameters and interactive content, including: Obtain the user's current voice data and identify the user's intent; Based on historical voice interaction records, determine the contextual information associated with the user's intent; Generate interactive content based on at least one of user intent, contextual information, and preset system voice; The configuration parameters of the speech synthesis engine are adjusted according to the personalized voice interaction parameters, and the interactive speech stream is synthesized based on the interactive content; Output interactive voice stream.

[0012] In one alternative of the first aspect, an interactive voice stream is output based on personalized voice interaction parameters and interactive content, including: If multiple users are identified, the identity information and seat position of each user are determined separately. User priority is determined based on identity information and seat location, and interactive voice stream is output based on user priority, personalized voice interaction parameters, and interactive content.

[0013] Secondly, this application provides an in-vehicle voice personalized interaction device, comprising: The user information acquisition module is used to acquire the user's identity information and, based on the user's identity information, acquire the user's historical voice interaction records; wherein, the identity information is obtained based on the user's facial image information and voice information recognition; The interaction preference analysis module is used to generate personalized voice interaction parameters that match the user's interaction preferences based on identity information and historical voice interaction records; these parameters include speech rate, timbre, and pitch. The voice interaction execution module is used to output an interactive voice stream based on personalized voice interaction parameters and interactive content; wherein, the interactive content is generated based on at least one of the user's current voice data, historical voice interaction records, and preset system voice.

[0014] Thirdly, this application provides an electronic device including a memory and a processor. The memory is used to store computer programs or instructions that, when executed by the processor, implement the method described in the first aspect or any of the alternative solutions of the first aspect.

[0015] Fourthly, this application provides a computer-readable storage medium. The storage medium stores a computer program or instructions that, when executed by a processor, implement the method described in the first aspect or any of the alternative solutions to the first aspect.

[0016] The beneficial effects of the technical solutions provided in some embodiments of this specification include at least the following: This application employs multimodal identity recognition based on facial images and voice information to pre-identify the user when the user triggers voice interaction. After identity confirmation, it retrieves corresponding historical voice interaction records to generate personalized voice interaction parameters including speech rate, timbre, and pitch. Based on these personalized parameters and the interaction content, it outputs a customized voice stream. This application achieves dynamic perception of user interaction preferences and emotional states by outputting a customized voice stream based on personalized parameters and interaction content. Unlike the single and rigid interaction methods in existing technologies, this application avoids the problem of mismatch between response content and actual user needs when multiple people simultaneously initiate interaction requests in the cabin through multimodal identity recognition. At the same time, because historical interaction records are considered when determining personalized voice interaction parameters, interaction efficiency is improved, and the flexibility and humanization of the interaction are significantly enhanced. Attached Figure Description

[0017] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0018] Figure 1 A flowchart illustrating the in-vehicle voice personalized interaction method provided in this application embodiment; Figure 2This is a schematic diagram of the structure of the in-vehicle voice personalized interaction device provided in the embodiments of this application; Figure 3 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Detailed Implementation

[0019] Embodiments of the present disclosure will now be described with reference to the accompanying drawings. However, it should be understood that these descriptions are exemplary only and are not intended to limit the scope of the disclosure. Furthermore, descriptions of well-known structures and technologies are omitted in the following description to avoid unnecessarily obscuring the concepts of the present disclosure.

[0020] The accompanying drawings illustrate various structural schematics according to embodiments of the present disclosure. These drawings are not to scale, and some details have been enlarged for clarity, and some details may have been omitted. The shapes of the various regions and layers shown in the drawings, as well as their relative sizes and positional relationships, are merely exemplary and may deviate from reality due to manufacturing tolerances or technical limitations. Furthermore, those skilled in the art can design regions / layers with different shapes, sizes, and relative positions as needed.

[0021] In the context of this disclosure, when a layer / element is referred to as being "above" another layer / element, the layer / element may be directly above the other layer / element, or there may be an intermediate layer / element between them. Additionally, if a layer / element is "above" another layer / element in one orientation, then when the orientation is reversed, the layer / element may be "below" the other layer / element.

[0022] The following is a brief explanation of the terms used in this application: OMS (Occupant Monitoring System): User monitoring system refers to an in-vehicle monitoring device that collects user facial images and posture information through visual sensors such as in-vehicle cameras. Voiceprint recognition: A technique based on biometric recognition technology that verifies or identifies a user's identity by analyzing and extracting voiceprint features from their voice. Large Language Model Dialogue System: An in-vehicle intelligent voice interaction processing module built on Large Language Model (LLM) technology, possessing natural language understanding and generation capabilities.

[0023] Current mainstream solutions for in-vehicle voice interaction systems mostly employ single-channel biometric authentication (such as relying solely on voiceprint or facial vision) combined with a standardized voice interaction mechanism that is generic and indiscriminate. The inventors discovered that, due to the lack of a mapping relationship between user identity and historical behavioral data, the system lacks user memory, consistently treating each conversation as an initial interaction with a new user, making it difficult to instantly perceive the user's personalized preferences and emotional state. This rigid and fixed interaction format leads to a deviation between the feedback content and the true intent, resulting not only in low communication efficiency but also raising the usage threshold for special groups such as the elderly and children, creating a technical problem of poor interactive experience.

[0024] Based on the above-mentioned technical problems, the inventive concept of this application is to: lock the user's identity in advance by pre-setting a multimodal identity recognition mechanism before the user triggers voice interaction, and pre-load historical interaction records based on the identity to generate a hierarchical and progressive personalized voice service strategy.

[0025] The technical solutions of this application and how they solve the aforementioned technical problems will be described in detail below with specific embodiments. These specific embodiments can be combined with each other, and the same or similar concepts or processes may not be repeated in some embodiments. The embodiments of this application will now be described with reference to the accompanying drawings.

[0026] It should be noted that the executor of this application may be, but is not limited to, in-vehicle computing platforms, intelligent cockpit domain controllers, or in-vehicle entertainment hosts, as well as in-vehicle voice interaction control devices that have the capabilities of multimodal identity fusion recognition, historical voice interaction record management, and personalized voice stream generation.

[0027] refer to Figure 1 , Figure 1 A flowchart illustrating the in-vehicle voice personalized interaction method provided in this application embodiment. Figure 1 As shown, this in-vehicle voice personalized interaction method includes at least the following steps: S101: Obtain the user's identity information and obtain the user's historical voice interaction records based on the user's identity information; wherein, the identity information is obtained based on the user's facial image information and voice information recognition.

[0028] Specifically, when a user triggers a voice interaction, the control device simultaneously calls the OMS to collect the user's facial image information and activates the voiceprint recognition module to collect the user's voice information. The facial features and voiceprint features are cross-verified through a multimodal data fusion algorithm to lock the user's identity. After the identity is confirmed, the device immediately retrieves the user's historical voice interaction records from the voice interaction record storage module, such as the content of previous voice commands, dialogue context, interaction time, and interaction preference data.

[0029] S103: Based on identity information and historical voice interaction records, generate personalized voice interaction parameters that match the user's interaction preferences; among which, the parameters include speech rate, timbre, and pitch.

[0030] Specifically, the control device analyzes the retrieved historical voice interaction records, extracts data such as the user's previously selected speech rate preferences, timbre preferences, and pitch characteristics, and combines them with information such as age and gender attributes in the identity information. Through rule mapping or large model inference, it generates a personalized voice interaction parameter combination containing specific speech rate, specific timbre, and specific pitch, forming a unique voice configuration scheme for the current user.

[0031] S105: Output interactive voice stream based on personalized voice interaction parameters and interactive content; wherein, the interactive content is generated based on at least one of the user's current voice data, historical voice interaction records and preset system voice.

[0032] Specifically, after receiving the user's current voice data, the control device analyzes the contextual intent of the current command by combining it with the loaded historical voice interaction records. It then generates a natural language response as the interaction content through a large-scale dialogue system. Subsequently, it calls a speech synthesis engine to convert the interaction content into an interactive speech stream adapted to the user's auditory habits according to the generated personalized voice interaction parameters and outputs it. It should be noted that "contextual intent" refers to the logical dependency or referential relationship between the current command and historical interactions in a continuous dialogue (for example, if the previous sentence asks "What restaurants are nearby?" and the next sentence says "Navigate to the first one"—the "first one" in the second sentence depends on the historical records of the previous sentence to understand).

[0033] Understandably, the above three steps work together to form a complete technical closed loop of "identity-first locking - historical preloading - personalized response": First, through the multimodal cross-validation mechanism in step S101, the spatial stability of facial recognition is combined with the temporal continuity of voiceprint recognition, which significantly improves the accuracy of identity recognition compared to single-modality recognition, effectively solving the problem of misjudgment of identity caused by in-vehicle noise or users' casual sitting posture; then, through parameter generation based on historical records and identity attributes in step S103, the generalized voice service is transformed into a customized configuration of "a thousand faces for a thousand people", solving the problems of comprehension barriers caused by excessively fast speech speed for elderly users and attention distraction caused by monotonous timbre for children; finally, through step S105, personalized parameters are deeply injected into the speech synthesis process, so that the output is not only semantically related interactive content, but also a voice stream whose acoustic form accurately matches the user's physiological characteristics and psychological expectations. Thus, a progressive enhancement is formed from the accuracy of identity verification to the adaptability of service strategies and the emotionalization of interactive output. This avoids service deviations caused by identity recognition errors, contextual breaks caused by lack of historical memory, and barriers to use by special groups caused by universal interaction modes, thereby significantly improving the personalized service level of in-vehicle voice interaction and the interaction convenience of different user groups.

[0034] In some embodiments, the identity information is obtained based on the following identification method: Collect users' facial image and voice information, and extract the corresponding facial image features and voiceprint features; The first identity information is determined based on facial image features and a preset user database, and the second identity information is determined based on voiceprint features and a preset user database. If the first identity information matches the second identity information, the identification is considered successful and either the first or second identity information is used as the identity information.

[0035] Specifically, when the control device detects that the user has triggered a voice interaction command, it simultaneously activates the OMS camera to collect the user's facial image information and activates the microphone array to collect the user's voice information. Then, it can extract facial feature vectors from the facial image through a face detection algorithm, and extract voiceprint feature vectors from the sound signal through a voiceprint preprocessing and feature extraction algorithm, thus completing the parallel acquisition and characterization of dual-channel biometric data.

[0036] Subsequently, the control device compares the extracted facial image feature vector with the registered facial features stored in the preset user database for similarity, and determines the first identity information (i.e., the user ID corresponding to the facial recognition result) based on the similarity threshold; at the same time, it performs pattern matching between the extracted voiceprint feature vector and the registered voiceprint features stored in the preset user database, and determines the second identity information (i.e., the user ID corresponding to the voiceprint recognition result) based on the matching score, thus realizing independent identity determination based on different biometric channels.

[0037] Next, the control device performs a consistency check on the first identity information and the second identity information. If the two match (i.e., facial recognition and voiceprint recognition point to the same user), the multimodal cross-validation is deemed successful, and the first or second identity information is output as the final confirmed identity information. If the two do not match, a second collection is triggered or recognition is rejected to prevent identity mismatch caused by misidentification due to a single modality.

[0038] Understandably, the above three steps constitute a multimodal fusion mechanism of "dual-channel acquisition - independent judgment - cross-validation": firstly, parallel acquisition avoids the time loss caused by serial processing; secondly, independent judgment preserves the complete recognition information of each modality; and finally, consistency verification achieves a multiplicative reduction in the false recognition rate (i.e., the product of the facial false recognition rate and the voiceprint false recognition rate). Since a single modality has inherent blind spots in complex in-vehicle environments (such as user's casual sitting posture causing facial occlusion or engine noise interfering with sound acquisition), this embodiment requires mutual verification of the two modal recognition results. Only when the face and voiceprint simultaneously point to the same identity is the recognition confirmed. This forces impersonation or misidentification to simultaneously overcome two biometric defenses, thus significantly improving the accuracy of identity recognition compared to a single modality. This effectively solves the problem of identity locking failure caused by in-vehicle environmental interference or the user's casual posture, providing a reliable identity foundation for subsequent personalized voice services.

[0039] In some embodiments, the method further includes: If the first identity information and the second identity information are inconsistent, the identification is considered to have failed. Add the current user's identity information to the preset user database and store the current user's voice interaction records.

[0040] Specifically, when the control device determines that the first identity information (determined based on facial image features) is inconsistent with the second identity information (determined based on voiceprint features), it triggers the security protection mechanism for recognition failure, refuses to output any single-modal recognition result as valid identity information, and marks the current user as unregistered, so as to avoid identity mismatch and subsequent service deviation caused by misidentification of a single modality (such as misidentifying another person's face or misrecording environmental noise voiceprint).

[0041] Furthermore, the control device automatically creates a new identity profile for the currently unregistered user in the preset user database, assigns a unique user identifier, and binds and stores the facial image features and voiceprint features collected this time with the identifier as a basic biometric template. At the same time, it initializes the user's voice interaction record storage space and saves the voice command content, dialogue context and interaction preference data of this and subsequent times in real time.

[0042] Understandably, the aforementioned "automatic registration upon recognition failure" mechanism constructs a zero-threshold user experience closed loop of "file creation upon first use": when bimodal recognition is inconsistent, the system does not simply refuse service, but automatically completes new user registration based on the bimodal data collected this time. This allows users to start using voice interaction services without having to go through a complicated manual entry process to pre-register their voiceprint and facial information. At the same time, since the identity profile is established at the time of the first interaction, the voice interaction records generated from the first use are continuously accumulated, avoiding the personalized service gap caused by the lack of historical data for new users in the traditional model. Thus, while ensuring the security of identity recognition, it achieves the technical effect of users enjoying complete personalized services as soon as they get in the car, significantly reducing the initial usage threshold of the intelligent cockpit voice system.

[0043] In some embodiments, the method further includes: If the first identity information cannot be determined based on facial image features and the preset user database, or if the second identity information cannot be determined based on voiceprint features and the preset user database, then the recognition is considered to have failed. Add the current user's identity information to the preset user database and store the current user's voice interaction records.

[0044] Specifically, when the control device fails to match a registered face with sufficient similarity in the preset user database based on facial image features (e.g., the user did not register facial information when using the vehicle for the first time), or fails to match a registered voiceprint with sufficient confidence in the preset user database based on voiceprint features (e.g., the user's voiceprint temporarily changes due to a cold), it determines that the corresponding modality recognition has failed and triggers the recognition failure mechanism, refusing to provide personalized services based on historical records, in order to prevent identity theft or data interference caused by forced matching of a single modality.

[0045] Furthermore, the control device automatically binds the currently collected facial image features with voiceprint features, creates a new user identity profile in the preset user database and assigns a unique identifier, and initializes the user's exclusive voice interaction record storage space, saving the content of this conversation in real time as initial historical data, thus completing a seamless connection from identity registration to service activation.

[0046] Understandably, the above mechanism constructs a fault-tolerant and self-learning closed loop for scenarios where "single-modal or bimodal approaches cannot match existing profiles": by allowing automatic registration to establish identity profiles even when some biometric features are missing (e.g., the database does not have the user's facial or voiceprint template) or temporarily invalid (e.g., the user's voice is abnormal), it avoids service rejection caused by the rigid requirement of bimodal pre-registration; at the same time, by creating profiles in real time and storing interaction records, users can accumulate personalized data from the first use. As the number of uses increases, the historical voice interaction records become richer, and the personalized parameters generated by the large model will be continuously optimized with the accumulation of data, thereby achieving a progressive intelligent service effect of "usable from the first use and becoming more intelligent with each use", which lowers the system's entry threshold while ensuring the continuous evolution capability of personalized services.

[0047] In some embodiments, personalized voice interaction parameters that conform to the user's interaction preferences are generated based on identity information and historical voice interaction records, including: Based on identity information, historical preference data of users is extracted from historical voice interaction records; wherein, historical preference data includes historical settings records and / or preference statistics of users for at least one of speech rate, timbre, and pitch; Personalized voice interaction parameters are generated based on historical preference data.

[0048] Specifically, after confirming the user's identity, the control device indexes the corresponding historical data storage area based on the identity information and mines and extracts historical preference data from the user's historical voice interaction records. The historical preference data includes both explicit settings records such as speech rate level, timbre type, and tone mode that the user has actively selected in the system settings interface, and preference statistics inferred by the control device by analyzing the user's behavioral characteristics in historical dialogues (such as the duration of dialogue at a specific speech rate and the frequency of repeated commands for different timbres).

[0049] Furthermore, based on the extracted historical preference data, the control device uses a rule mapping table or machine learning model to infer abstract preference information into specific executable parameters, generating personalized voice interaction parameters that include specific speech rate values, specific timbre identifiers, and specific pitch curves, in order to directly configure the output characteristics of the subsequent speech synthesis engine.

[0050] Understandably, the aforementioned "historical data mining - automatic parameter generation" mechanism achieves a closed-loop mapping from user history to current service strategies: since historical preference data includes both explicit preferences actively set by users and implicit habits inferred by the system through behavioral analysis, the control device can accurately capture the differentiated needs of different user groups (such as the elderly's preference for slow speech speed and children's preference for lively voice timbre), eliminating the need for users to manually adjust voice settings each time they get in the car; compared to the "one-size-fits-all" experience caused by traditional universal voice output (such as fixed speech speed causing comprehension difficulties for the elderly with hearing loss, and monotonous voice timbre lacking appeal for children), this embodiment, by preloading historical preferences and automatically configuring output parameters, enables voice interaction to proactively adapt to the user's physiological characteristics and psychological expectations at the acoustic level, thereby significantly reducing user operating costs and improving the convenience and auditory comfort of interaction, especially having significant value in improving the user experience of special groups (the elderly and children).

[0051] In some embodiments, an interactive voice stream is output based on personalized voice interaction parameters and interactive content, including: Obtain the user's current voice data and identify the user's intent; Based on historical voice interaction records, determine the contextual information associated with the user's intent; Generate interactive content based on at least one of user intent, contextual information, and preset system voice; The configuration parameters of the speech synthesis engine are adjusted according to the personalized voice interaction parameters, and the interactive speech stream is synthesized based on the interactive content; Output interactive voice stream.

[0052] Specifically, the control device collects the user's current voice data through an in-vehicle microphone array, performs noise reduction, feature extraction, and speech recognition processing on the audio signal, and then uses a natural language understanding model to analyze and recognize the user's current intent (such as specific interactive needs like setting navigation destination, controlling multimedia playback, or querying information).

[0053] Subsequently, the control device retrieves contextual information related to the current user's intent based on the acquired historical voice interaction records, including unfinished matters in previous dialogues, entity references mentioned by the user (such as "that restaurant" or "the first result" mentioned above), and clues to the continuation of the dialogue topic, in order to establish a logical connection between the current command and historical dialogues.

[0054] Based on this, the control device uses at least one of the identified user intent, determined context information, and preset system voice script templates to generate natural language response text as interactive content through large model dialogue system reasoning, ensuring that the response is semantically consistent with the user's current needs and historical dialogues.

[0055] Subsequently, the control device adjusts the corresponding configuration parameters of the speech synthesis engine based on the generated personalized voice interaction parameters (including speech rate, timbre identifier, and pitch curve), and calls the speech synthesis engine to synthesize an interactive voice stream adapted to the user's auditory characteristics based on the generated interactive content.

[0056] Finally, the control device outputs the synthesized interactive voice stream to the user through the vehicle's speakers, completing this round of personalized voice interaction.

[0057] Understandably, the above five steps constitute a complete interactive closed loop of "real-time intent capture—contextual association—content generation—acoustic adaptation—voice output": First, real-time speech recognition ensures accurate capture of the user's current needs; second, historical record retrieval enables precise association of the dialogue context (solving the ambiguity of reference and semantic gaps caused by isolated interactions); third, a large model generates semantically coherent response content; and fourth, deep injection of personalized parameters ensures that the acoustic characteristics of the synthesized speech output accurately match the user's physiological characteristics and preferences (such as the slow speech rate needed by the elderly and the lively tone preferred by children). Compared to traditional solutions that suffer from repeated confirmations due to a lack of contextual memory and auditory discomfort and comprehension barriers caused by generic voice output, this embodiment achieves end-to-end personalization from semantic understanding to acoustic presentation, significantly improving the naturalness, coherence, and user comfort of the interaction, especially ensuring the continuity of interaction and experience quality for special groups in complex in-vehicle environments.

[0058] In some embodiments, an interactive voice stream is output based on personalized voice interaction parameters and interactive content, including: If multiple users are identified, the identity information and seat position of each user are determined separately. User priority is determined based on identity information and seat location, and interactive voice stream is output based on user priority, personalized voice interaction parameters, and interactive content.

[0059] Specifically, when the control device detects that multiple users in the cabin are simultaneously triggering voice interaction through a distributed microphone array, it performs independent identification of each sound source. This is achieved by using the OMS camera in combination with the cabin camera layout (different cameras cover different seat areas) to determine the facial image information and corresponding seat position of each user. At the same time, voiceprint recognition is used to determine the identity information of each user, thereby establishing a mapping relationship between "user identity and seat position".

[0060] Then, the control device calculates and determines the user priority sequence based on the preset priority rules (such as the driver's seat priority principle based on driving safety, the elder's identity priority principle based on social etiquette, or the first wake-up priority principle based on time sequence), and calculates and determines the user priority sequence by combining the identity information (age, role) and seat position (driving relevance) of each user. Based on this priority order, it sequentially or selectively calls personalized voice interaction parameters and interaction content, and generates and outputs the corresponding interactive voice stream.

[0061] Understandably, the aforementioned mechanism constructs a three-dimensional decision-making model of "identity-location-priority" for concurrent multi-occupant interaction scenarios in smart cockpits: First, by combining OMS facial capture with the spatial layout of cockpit cameras, abstract identity information is bound to specific physical seats, enabling control devices to distinguish the interaction needs of users in different seats; second, priority calculation transforms the disordered concurrent requests of multiple users into an ordered response sequence, avoiding auditory confusion and command conflicts caused by simultaneous output of multiple voices. Since commands from the driver's seat in driving scenarios often involve driving safety (such as navigation changes and air conditioning adjustments requiring immediate response), while in family travel scenarios, elders or children may require priority service, this embodiment, through the coupled calculation of seat position and identity attributes, ensures both driving safety (driver priority) and social etiquette (elders / children priority), while time-sharing / zone-sharing output based on personalized parameters avoids mutual interference between different users' voice preferences, thereby significantly improving the orderliness, safety, and user experience of in-vehicle voice interaction in multi-user scenarios.

[0062] Based on the same technical concept, this application also provides an in-vehicle voice personalized interaction device, see reference. Figure 2 , Figure 2 This is a schematic diagram of the structure of the in-vehicle voice personalized interaction device provided in an embodiment of this application. Figure 2 As shown, the in-vehicle personalized voice interaction device includes: The user information acquisition module 201 is used to acquire the user's identity information and acquire the user's historical voice interaction records based on the user's identity information; wherein, the identity information is obtained based on the user's facial image information and voice information recognition; The interaction preference analysis module 202 is used to generate personalized voice interaction parameters that match the user's interaction preferences based on identity information and historical voice interaction records; among which, the parameters include speech rate, timbre, and pitch. The voice interaction execution module 203 is used to output an interactive voice stream based on personalized voice interaction parameters and interactive content; wherein, the interactive content is generated based on at least one of the user's current voice data, historical voice interaction records and preset system voice.

[0063] It should be noted that this in-vehicle voice personalized interaction device can be used to implement any of the above method embodiments.

[0064] Based on the same technical concept, this application also provides an electronic device, see reference. Figure 3 , Figure 3 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Figure 3 As shown, the electronic device includes a memory 301 and a processor 302. The memory 301 is used to store computer instructions; when the processor 302 executes the computer instructions, it implements any of the above-described method embodiments.

[0065] The specific entity of the electronic device can be a smartphone, wearable device, tablet computer, personal computer, in-vehicle terminal, game console, virtual device, workbench, digital assistant, set-top box, robot, or other terminal device. In other embodiments, it can be a rack server, blade server, tower server, or cabinet server (including a standalone server or a server cluster composed of multiple servers).

[0066] The memory 301 includes at least one type of computer-readable storage medium, including flash memory, hard disk, multimedia card, random access memory (RAM), static random access memory (SRAM), read-only memory (ROM), magnetic disk, optical disk, etc. In some embodiments, the computer-readable storage medium may be an internal storage unit of an electronic device, such as the hard disk or memory of the electronic device. In other embodiments, the computer-readable storage medium may also be an external storage device of the electronic device, such as a plug-in hard disk, secure digital card (SD card), flash memory card, etc., equipped on the electronic device. Of course, the computer-readable storage medium may include both internal storage units and external storage devices of the electronic device. In this embodiment, the computer-readable storage medium is typically used to store the operating system and various application software installed on the electronic device, such as the program code of the in-vehicle voice personalized interaction method in this embodiment. In addition, the computer-readable storage medium may also be used to temporarily store various types of data that have been output or will be output.

[0067] In some embodiments, processor 302 may be a central processing unit (CPU), a controller, a microcontroller, a microprocessor, or other chip. Processor 302 is typically used to control the overall operation of the processing device, such as performing control and processing related to data interaction or communication with other entities. In this embodiment, processor 302 is used to run program code stored in memory 301 or process data.

[0068] Based on the same technical concept, this application also provides a computer-readable storage medium, which includes a computer program or instructions stored in the storage medium. When the computer program or instructions are executed by a processing device, they implement any of the above-described method embodiments. Further details can be found in the method embodiments, which will not be repeated here. In this embodiment, the computer-readable storage medium includes flash memory, hard disk, multimedia card, random access memory (RAM), static random access memory (SRAM), read-only memory (ROM), magnetic disk, optical disk, etc. In some embodiments, the computer-readable storage medium can be an internal storage unit of an electronic device, such as the hard disk or memory of the electronic device. In other embodiments, the computer-readable storage medium can also be an external storage device of the electronic device, such as a plug-in hard disk, secure digital card (SD card), flash memory card, etc., equipped on the electronic device. Of course, the computer-readable storage medium can also include both internal storage units and external storage devices of the electronic device. In this embodiment, the computer-readable storage medium is typically used to store the operating system and various application software installed on the electronic device, such as the program code of the in-vehicle voice personalized interaction method in the embodiment. Furthermore, the computer-readable storage medium can also be used to temporarily store various types of data that have been output or will be output.

[0069] The above description does not provide detailed technical specifications regarding the structure of each layer. However, those skilled in the art should understand that layers and regions of desired shapes can be formed using various technical means. Furthermore, to form the same structure, those skilled in the art can also design methods that are not entirely identical to those described above. Additionally, although various embodiments have been described above, this does not mean that the measures in the various embodiments cannot be advantageously combined.

[0070] Although preferred embodiments of this application have been described, those skilled in the art, upon learning the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, the appended claims are intended to be interpreted as including the preferred embodiments as well as all changes and modifications falling within the scope of this application.

[0071] Obviously, those skilled in the art can make various modifications and variations to this application without departing from the spirit and scope of this application. Therefore, if such modifications and variations fall within the scope of the claims of this application and their equivalents, this application also intends to include such modifications and variations.

Claims

1. A method for personalized voice interaction in vehicles, characterized in that, include: Obtain the user's identity information, and based on the user's identity information, obtain the user's historical voice interaction records; wherein, the identity information is obtained based on the user's facial image information and voice information recognition; Based on the identity information and the historical voice interaction records, personalized voice interaction parameters that match the user's interaction preferences are generated; wherein, the parameters include speech rate, timbre, and pitch. The interactive voice stream is output based on personalized voice interaction parameters and interactive content; wherein, the interactive content is generated based on at least one of the user's current voice data, historical voice interaction records, and preset system voice.

2. The method according to claim 1, characterized in that, The identity information is obtained based on the following method: Collect the user's facial image information and voice information, and extract the corresponding facial image features and voiceprint features; First identity information is determined based on the facial image features and a preset user database, and second identity information is determined based on the voiceprint features and the preset user database. If the first identity information matches the second identity information, then the identification is considered successful and either the first identity information or the second identity information is used as the identity information.

3. The method according to claim 2, characterized in that, The method further includes: If the first identity information does not match the second identity information, the identification is considered to have failed. The system adds the current user's identity information to the preset user database and stores the current user's voice interaction records.

4. The method according to claim 2, characterized in that, The method further includes: If the first identity information cannot be determined based on the facial image features and the preset user database, or if the second identity information cannot be determined based on the voiceprint features and the preset user database, then the recognition is considered to have failed. The system adds the current user's identity information to the preset user database and stores the current user's voice interaction records.

5. The method according to claim 1, characterized in that, The process of generating personalized voice interaction parameters that match the user's interaction preferences based on the identity information and the historical voice interaction records includes: Based on the identity information, the user's historical preference data is extracted from historical voice interaction records; wherein, the historical preference data includes the user's historical settings records and / or preference statistics for at least one of the speech rate, timbre, and pitch; The personalized voice interaction parameters are generated based on the historical preference data.

6. The method according to claim 1, characterized in that, The output of the interactive voice stream based on personalized voice interaction parameters and interactive content includes: Obtain the user's current voice data and identify the user's intent; Based on the historical voice interaction records, determine the context information associated with the user's intent; Interactive content is generated based on at least one of the user intent, the context information, and the preset system voice. The configuration parameters of the speech synthesis engine are adjusted according to the personalized voice interaction parameters, and the interactive voice stream is synthesized based on the interactive content. Output the interactive voice stream.

7. The method according to claim 1, characterized in that, The output of the interactive voice stream based on personalized voice interaction parameters and interactive content includes: If multiple users are identified, the identity information and seat position of each user are determined separately. User priority is determined based on the identity information and seat position, and an interactive voice stream is output based on the user priority, personalized voice interaction parameters, and interaction content.

8. A vehicle-mounted personalized voice interaction device, characterized in that, include: The user information acquisition module is used to acquire the user's identity information and, based on the user's identity information, acquire the user's historical voice interaction records; wherein, the identity information is obtained based on the user's facial image information and voice information recognition; An interaction preference analysis module is used to generate personalized voice interaction parameters that match the user's interaction preferences based on the identity information and the historical voice interaction records; wherein, the parameters include speech rate, timbre, and pitch. The voice interaction execution module is used to output an interactive voice stream based on personalized voice interaction parameters and interactive content; wherein, the interactive content is generated based on at least one of the user's current voice data, historical voice interaction records, and preset system voice.

9. An electronic device, characterized in that, It includes a memory and a processor, the memory being used to store computer programs or instructions; when the computer programs or instructions are executed by the processor, the method of any one of claims 1-7 is implemented.

10. A computer-readable storage medium, characterized in that, The storage medium stores a computer program or instructions, which, when executed by a processor, implement the method of any one of claims 1-7.