Voice dialogue method and device, terminal device and storage medium
By acquiring user status information and using dialogue models to generate personalized dialogue content and voiceprint features, the problem of AI chat programs requiring users to actively wake them up has been solved, enabling proactive companionship and improved human-computer interaction.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- GUANGDONG XIAOTIANCAI TECH CO LTD
- Filing Date
- 2023-08-09
- Publication Date
- 2026-06-23
AI Technical Summary
Existing AI chat programs require users to actively wake them up, and cannot proactively provide companionship, resulting in poor companionship effectiveness.
By acquiring user status information, using a dialogue model to generate personalized dialogue content, and generating dialogue speech based on voiceprint features, the system can proactively initiate voice dialogues.
It improves the companionship effect of AI voice chat, enhances the human-computer interaction of terminal devices, and provides a more intimate and comfortable voice dialogue experience.
Smart Images

Figure CN119479661B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of human-computer interaction technology, and in particular to a voice dialogue method, device, terminal equipment and storage medium. Background Technology
[0002] Currently, AI (Artificial Intelligence) chat technology is constantly developing, allowing users to engage in conversations with AI chat programs, which can provide companionship, especially for the elderly. However, current AI chat programs still require active user activation; for example, users must actively open the software and type in content to activate it, or activate it via voice, before the AI chat program will provide feedback, resulting in a relatively poor level of companionship. Summary of the Invention
[0003] This application discloses a voice dialogue method, apparatus, terminal device, and storage medium, which can enhance the companionship effect provided by AI voice chat to users and improve the human-computer interaction of terminal devices.
[0004] This application discloses a voice dialogue method applied to a first terminal device, the method comprising:
[0005] Obtain user status information, which includes first status information corresponding to a first user of the first terminal device and / or second status information corresponding to a second user, wherein the second user and the first user are relatives or friends;
[0006] The dialogue model generates dialogue content based on the first state information and / or the second state information.
[0007] The voiceprint features of the target user are obtained, and a dialogue voice is generated based on the voiceprint features and the dialogue content, and the dialogue voice is played; wherein, the target user includes the second user, or other relatives or friends who are relatives or friends of the first user.
[0008] In one embodiment, generating dialogue content through a dialogue model based on the first state information and / or the second state information includes:
[0009] If the first state information is detected to meet the first preset condition, then dialogue content is generated by the dialogue model based on the first state information, or by the dialogue model based on the first state information and the second state information.
[0010] The first status information includes the first scene information of the first user, and the first preset condition includes: the first scene information matches any one of a plurality of first preset scene information; and / or,
[0011] The first state information includes the first user's emotional information, and the first preset condition includes: the first user's emotional information matches any one of a plurality of preset emotional information.
[0012] In one embodiment, generating dialogue content through a dialogue model based on the first state information and / or the second state information includes:
[0013] If the second status information is received, dialogue content is generated through the dialogue model based on the second status information, or through the dialogue model based on the first status information and the second status information; wherein, the second status information is sent by the second terminal device when it detects that the second status information meets the second preset condition;
[0014] The second status information includes the second scene information of the second user, and the second preset condition of the second user includes: the second scene information matches any one of a plurality of second preset scene information.
[0015] In one embodiment, generating dialogue content based on the first state information and / or the second state information through the dialogue model includes:
[0016] Based on the first status information and / or the second status information, determine the target user from among multiple relatives and friends who are relatives and friends of the first user;
[0017] The first state information and / or the second state information, as well as the user information of the target user, are used as input information for the dialogue model;
[0018] The input information is input into the dialogue model, and the dialogue model obtains the dialogue features of the target user based on the user information. Based on the dialogue features, and according to the first state information and / or the second state information, dialogue content corresponding to the target user is generated.
[0019] In one embodiment, after taking the first state information and / or the second state information, and the user information of the target user as input information to the dialogue model, the method further includes:
[0020] If the input information is the same as the historical input information, then the historical information corresponding to the historical input information is obtained. The historical information includes the historical input information, the historical dialogue content corresponding to the historical input information, and the historical reply content of the first user.
[0021] The step of inputting the input information into a dialogue model, obtaining the dialogue features of the target user based on the user information through the dialogue model, and generating dialogue content corresponding to the target user based on the dialogue features and according to the first state information and / or the second state information includes:
[0022] The input information and the historical information are input into the dialogue model, and the dialogue model obtains the dialogue features of the target user based on the user information. Based on the dialogue features, the dialogue content corresponding to the target user is generated according to the first state information and / or the second state information, as well as the historical information.
[0023] In one embodiment, after playing the dialogue voice, the method further includes:
[0024] The first user's response to the dialogue was detected.
[0025] The content of the reply voice is recognized to determine the reply content corresponding to the reply voice;
[0026] The reply content, the dialogue content, and the input information are saved as historical information.
[0027] In one embodiment, after performing content recognition on the response speech to determine the response content corresponding to the response speech, the method further includes:
[0028] The reply and the conversation are sent to the target user's terminal device.
[0029] This application discloses a voice interaction device applied to a first terminal device, the device comprising:
[0030] The information acquisition module is used to acquire user status information, which includes first status information corresponding to a first user of the first terminal device and / or second status information corresponding to a second user, wherein the second user and the first user are relatives or friends.
[0031] The content generation module is used to generate dialogue content based on the first state information and / or the second state information through the dialogue model;
[0032] The voice playback module is used to acquire the voiceprint features of the target user, generate dialogue voice based on the voiceprint features and the dialogue content, and play the dialogue voice; wherein, the target user includes the second user, or other relatives or friends who are relatives or friends of the first user.
[0033] In one embodiment, the content generation module is further configured to, if it is detected that the first state information satisfies a first preset condition, generate dialogue content through a dialogue model based on the first state information, or through the dialogue model based on the first state information and the second state information; the first state information includes first scene information of the first user, and the first preset condition includes: the first scene information matches any one of a plurality of first preset scene information; and / or, the first state information includes the first user's emotion information, and the first preset condition includes: the first user's emotion information matches any one of a plurality of preset emotion information.
[0034] In one embodiment, the content generation module is further configured to, if the second status information is received, generate dialogue content through the dialogue model based on the second status information, or through the dialogue model based on the first status information and the second status information; wherein the second status information is sent by the second terminal device when it detects that the second status information meets a second preset condition; the second status information includes the second scene information of the second user, and the second preset condition of the second user includes: the second scene information matches any one of a plurality of second preset scene information.
[0035] In one embodiment, the content generation module is further configured to: determine a target user from multiple relatives and friends who are relatives or friends of the first user, based on the first state information and / or the second state information; use the first state information and / or the second state information, as well as the user information of the target user, as input information to a dialogue model; input the input information into the dialogue model, and obtain the dialogue features of the target user based on the user information through the dialogue model, and generate dialogue content corresponding to the target user based on the dialogue features and the first state information and / or the second state information.
[0036] In one embodiment, the content generation module is further configured to, if the input information is the same as historical input information, obtain historical information corresponding to the historical input information, the historical information including the historical input information, historical dialogue content corresponding to the historical input information, and historical reply content of the first user; the step of inputting the input information into the dialogue model, obtaining the dialogue features of the target user through the dialogue model based on the user information, and generating dialogue content corresponding to the target user based on the dialogue features, the first state information, and / or the second state information, includes: inputting the input information and the historical information into the dialogue model, obtaining the dialogue features of the target user through the dialogue model based on the user information, and generating dialogue content corresponding to the target user based on the dialogue features, the first state information, and / or the second state information, and the historical information.
[0037] In one embodiment, the voice dialogue device further includes an information storage module, configured to detect the first user's reply voice in response to the dialogue voice; perform content recognition on the reply voice to determine the reply content corresponding to the reply voice; and save the reply content, the dialogue content, and the input information as historical information.
[0038] In one embodiment, the voice dialogue device further includes an information sending module for sending the reply content and the dialogue content to the target user's terminal device.
[0039] This application discloses a terminal device, including:
[0040] Memory containing executable program code;
[0041] A processor coupled to the memory;
[0042] The processor calls the executable program code stored in the memory to execute the method described in any of the above embodiments.
[0043] This application discloses a computer-readable storage medium storing a computer program, wherein when executed by a processor, the computer program causes the processor to perform the methods described in any of the above embodiments.
[0044] Through the voice dialogue method, apparatus, terminal device, and storage medium disclosed in the embodiments of this application, the first terminal device can obtain user status information. This user status information may include first status information corresponding to a first user of the first terminal device, and / or second status information corresponding to a second user who is a relative or friend of the first user. The first terminal device can generate dialogue content based on the first status information and / or the second status information using a dialogue model. The first terminal device then obtains the voiceprint features of the target user, generates dialogue voice based on the voiceprint features and the dialogue content, and plays the dialogue voice. The target user may include the second user, or other relatives or friends of the first user. Implementing this embodiment, the first terminal device can generate dialogue content based on the user status information using a dialogue model, and generate dialogue voice based on the target user's voiceprint information and the dialogue content. This personalized dialogue content and voice can provide the first user with a more intimate and comfortable voice dialogue experience. Moreover, the first terminal device can proactively initiate voice dialogue based on the user status information, thereby enhancing the companionship effect provided by AI voice chat and improving the human-computer interaction of the terminal device. Attached Figure Description
[0045] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0046] Figure 1 This is a schematic diagram illustrating an application scenario of a voice dialogue method disclosed in an embodiment of this application;
[0047] Figure 2 This is a flowchart illustrating a voice dialogue method disclosed in an embodiment of this application;
[0048] Figure 3 This is a flowchart illustrating another voice dialogue method disclosed in an embodiment of this application;
[0049] Figure 4 This is a flowchart illustrating another voice dialogue method disclosed in an embodiment of this application;
[0050] Figure 5 This is a flowchart illustrating another voice dialogue method disclosed in an embodiment of this application;
[0051] Figure 6 This is a modular schematic diagram of a voice dialogue device disclosed in an embodiment of this application;
[0052] Figure 7This is a structural block diagram of a terminal device disclosed in an embodiment of this application. Detailed Implementation
[0053] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the scope of protection of this application.
[0054] It should be noted that the terms "comprising" and "having" and any variations thereof in the embodiments of this application are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or device that includes a series of steps or units is not necessarily limited to those steps or units that are explicitly listed, but may include other steps or units that are not explicitly listed or that are inherent to these processes, methods, products, or devices.
[0055] It is understood that the terms "first," "second," etc., used in this application may be used herein to describe various elements, but these elements are not limited by these terms. These terms are only used to distinguish one element from another. For example, without departing from the scope of this application, a first terminal device may be referred to as a second terminal device, and similarly, a second terminal device may be referred to as a first terminal device. Both the first terminal device and the second terminal device are terminal devices, but they are not the same terminal device.
[0056] This application discloses a voice dialogue method, apparatus, terminal device, and storage medium, which can enhance the companionship effect provided by AI voice chat to users and improve the human-computer interaction of terminal devices.
[0057] The following will be described in detail with reference to the accompanying drawings.
[0058] like Figure 1 As shown, Figure 1 This is a schematic diagram illustrating an application scenario of a voice dialogue method disclosed in an embodiment of this application. The application scenario may include a first terminal device 110, a first user 120, a second terminal device 130, and a second user 140. The first terminal device 110 and the second terminal device 130 may include, but are not limited to, mobile phones, tablets, wearable devices, laptops, and PCs (Personal Computers), etc. The first terminal device 110 can detect the first state information of the first user 120, and the second terminal device 120 can detect the second state information of the second user 140. The first user 120 and the second user 140 may be close friends or family members, for example... Figure 1The first user 120 in the list can be the grandmother of the second user 140. This is understandable, although... Figure 1 Only one second user 140 is shown as having a close relationship with the first user 120. However, users who have a close relationship with the first user 120 can include one or more others, such as the first user 120's son, daughter, friends, or other relatives or friends, who can also be second users. Figure 1 This is just one example. Each friend or relative user can correspond to a second terminal device 120, that is, the second terminal device 120 can include one or more.
[0059] The first terminal device 110 may include a dialogue model and a voice output device. The dialogue model is an AI chat program. The first terminal device 110 can generate dialogue content through the dialogue model and then play the dialogue voice through the voice output device, thereby engaging in voice dialogue with the first user 120 and providing companionship to the first user 120. Optionally, the first terminal device 110 and the second terminal device 130 may have a communication connection, and the first terminal device 110 and the second terminal device 130 may also have a communication connection with the same server. The second terminal device 130 can send the second status information corresponding to the second user 140 to the first terminal device 110, and the second terminal device 130 can also send the second status information corresponding to the second user 140 to the server, and the server can send the second status information back to the first terminal device 110.
[0060] In one embodiment, the first terminal device 110 can acquire user status information, which may include first status information corresponding to the first user 120 and / or second status information corresponding to the second user 140. The first terminal device 110 can generate dialogue content based on the first and / or second status information using a dialogue model. The first terminal device 110 then acquires the voiceprint features of the target user, generates dialogue speech based on the voiceprint features and dialogue content, and plays the dialogue speech. The target user includes the second user 140, or other relatives or friends of the first user 110. As an example, when the target user is the second user 140, the first terminal device 110 can acquire the voiceprint features of the second user 140, generate dialogue speech based on the voiceprint features of the second user 140 and dialogue content, and play the dialogue speech. Figure 1 For example, the first terminal device 110 can use the voiceprint characteristics of the second user 140 to play the dialogue voice of "Grandma, I'm out of school, what are you doing?"
[0061] like Figure 2 As shown, Figure 2This is a flowchart illustrating a voice dialogue method disclosed in an embodiment of this application. This voice dialogue method can be applied to the first terminal device described above, and the method may include the following steps:
[0062] Step 210: Obtain user status information.
[0063] Understandably, in order to provide companionship to the first user, the first terminal device can detect the first user's first state information, and the second terminal device of the second user can also detect the second user's second state information. The second user can refer to a friend or relative of the first user. The second terminal device can send the second state information to the first terminal device. The first terminal device can obtain user state information, which may include the first state information detected by the first terminal device corresponding to the first user, and / or the second state information received by the first terminal device from the second terminal device corresponding to the second user.
[0064] The status information can include the user's current scene and emotional state, without limitation. Scene information can represent a specific activity the user is currently performing in a specific location, such as playing football, running in a park, or sleeping at home. Emotional information can include feelings of happiness or sadness, without limitation. Understandably, the specific content of the status information can be determined based on the permissions granted to the first or second terminal device. Generally, the more permissions granted to a terminal device, the more content the status information can include, making the voice dialogue more realistic.
[0065] Optionally, the first terminal device can determine the first user's first state information by collecting first user data from its sensors, and the second terminal device can determine the second user's second state information by collecting second user data from its sensors. It is understood that the first user, as a user requiring companionship, may have more data than the second user. Optionally, the second user data may include necessary and unnecessary data. Necessary data refers to data required to determine the second state information, while unnecessary data refers to data that can be omitted. Necessary and unnecessary data can be pre-set. The second terminal device can be equipped with a switch for collecting unnecessary data. When the switch is on, the second terminal device can collect both necessary and unnecessary data; when the switch is off, the second terminal device only collects necessary data and not unnecessary data. This method can ensure the accuracy of the dialogue content generated based on user state information while protecting the second user's privacy.
[0066] Step 220: Generate dialogue content using the dialogue model based on the first state information and / or the second state information.
[0067] The first terminal device inputs the acquired first state information and / or second state information into the dialogue model. The dialogue model can generate dialogue content based on the first state information and / or second state information. The dialogue model can be a neural network model, including but not limited to convolutional neural networks, recurrent neural networks, Transformers, and other NLP (Natural Language Processing) models. This application embodiment does not limit the type of dialogue model.
[0068] The dialogue model can generate corresponding content based on the input user state information. The dialogue model can be trained on a text dataset. Furthermore, the dialogue model can store the dialogue features and voiceprint features corresponding to multiple relatives and friends of the first user. The dialogue features are used to represent the language expression habits of relatives and friends, while the voiceprint features refer to the voice features of relatives and friends.
[0069] Optionally, the dialogue model can generate dialogue content based on the first state information. For example, if the first user is the second user's grandmother, and the second state information includes scene information representing the first user taking a walk in the park, then the dialogue model can generate the dialogue content "Grandma, remember to drink water after your walk" based on this first state information. Optionally, the dialogue model can generate dialogue content based on the second state information. For example, if the first user is the second user's grandmother, and the second state information includes scene information representing the second user returning home from school, then the dialogue model can generate the dialogue content "Grandma, I'm home from school, what are you doing?" based on this second state information. Optionally, the dialogue model can generate dialogue content based on both the first and second state information. For example, if the first user is the second user's grandmother, and the first state information includes emotion information representing the first user's low mood, and the second state information includes scene information representing the second user currently playing soccer, then the dialogue model can generate the dialogue content "Grandma, I'm playing soccer, do you want to come see me?" based on both the first and second state information, to help the first user overcome their low mood. It should be noted that the above dialogue is only an example and does not mean that the dialogue model will necessarily generate this dialogue.
[0070] Optionally, the first terminal device can also incorporate current time, current air quality, current weather, and other real-world information into the first state information for input into the dialogue model. This real-world information can refer to information that the first terminal device can obtain from the network, and the dialogue content generated by the dialogue model can also be related to the real-world information. Implementing this embodiment can further improve the accuracy of the dialogue content.
[0071] Step 230: Obtain the voiceprint features of the target user, generate dialogue speech based on the voiceprint features and dialogue content, and play the dialogue speech.
[0072] The target user refers to the user corresponding to the dialogue voice output by the first terminal device, that is, the user simulated by the first terminal device in dialogue with the first user. The target user can include the second user, or other relatives or friends who are related to the first user.
[0073] In some embodiments, if the first terminal device obtains the second state information corresponding to the second user, the target user can be the second user; if the first terminal device does not obtain the second state information corresponding to the second user, but only obtains the first state information corresponding to the first user, the target user can be any relative or friend determined from multiple relatives and friends. For example, if the first terminal device obtains the first state information corresponding to the first user, and the emotional information included in the first state information indicates that the first user is in a low mood, and the multiple relatives and friends can include the first user's son, daughter, friends, partner, etc., then any relative or friend can be determined from the multiple relatives and friends as the target user to comfort the first user.
[0074] In other embodiments, if the first terminal device obtains the second state information corresponding to the second user, it can also select other relatives or friends as the target user besides the second user. For example, if the first user is the second user's grandmother, and the scene information included in the second state information represents the scene of the second user returning home from school, then the target user can also be the second user's father, i.e., the first user's son. The dialogue model can generate dialogue content such as "Mom, Xiaobao is home from school, I'll come see you in a bit!" based on the second state information. This dialogue content is also just one example.
[0075] The target user's voiceprint features can be stored in the cloud or on the first terminal device. The first terminal device can acquire the target user's voiceprint features and generate dialogue speech based on the voiceprint features and dialogue content using a TTS (Text-to-Speech) model. The first terminal device then plays the dialogue speech. The TTS model can include a recurrent neural network model or a converter model.
[0076] As an optional implementation, besides playing the dialogue voice, steps 210-230 above can also be performed by a server. The dialogue model can also be deployed on the server. Specifically, the server can obtain user state information and generate dialogue content based on the first state information and / or the second state information through the dialogue model. Then, it can obtain the target user's voiceprint features and generate dialogue voice based on the voiceprint features and dialogue content. Finally, it sends the dialogue voice to the first terminal device, and the first terminal device plays the dialogue voice. Implementing this implementation can reduce the computational load on the first terminal device and speed up the generation of dialogue voice.
[0077] In this embodiment, the first terminal device can generate dialogue content based on user status information through a dialogue model, and generate dialogue voice based on the target user's voiceprint information and the dialogue content. The personalized dialogue content and dialogue voice can provide the first user with a more intimate and comfortable voice dialogue experience. Moreover, the first terminal device can actively initiate voice dialogue based on user status information, thereby enhancing the companionship effect provided by AI voice chat and improving the human-computer interaction of the terminal device.
[0078] like Figure 3 As shown, Figure 3 This is a flowchart illustrating another voice dialogue method disclosed in an embodiment of this application. This voice dialogue method can be applied to the first terminal device in the above embodiments, and the voice dialogue method may include the following steps:
[0079] Step 310: Obtain user status information.
[0080] Step 320: If the first state information is detected to meet the first preset condition, then the dialogue content is generated by the dialogue model based on the first state information, or by the dialogue model based on the first state information and the second state information.
[0081] The first terminal device can acquire the first state information corresponding to the first user at preset intervals and detect whether the first state information meets the first preset condition. If the first terminal device detects that the first state information meets the first preset condition, the first terminal device can input the first state information or the first state information and the second state information into the dialogue model. The dialogue model can generate dialogue content based on the first state information or the first state information and the second state information.
[0082] Optionally, the first state information may include the first user's first scene information, and the first preset condition may include matching the first scene information with any one of multiple first preset scene information. The first scene information is determined based on data collected by the position sensor and motion sensor of the first terminal device. The first terminal device can collect position data through its position sensor and motion data through its motion sensor, and determine the first scene information based on this data. For example, if the position data collected by the position sensor indicates that the first user is in a park, and the motion data collected by the motion sensor indicates that the first user is moving slowly, then the first terminal device can determine that the first scene information is that the first user is taking a walk in the park. The multiple first preset scene information may be manually preset or preset based on the first user's lifestyle habits. For example, if the first user has a habit of taking walks in the park, in order for the first terminal device to be able to communicate with the first user when the first user is taking a walk in the park, the scene of the first user taking a walk in the park can be saved as a first preset scene information.
[0083] Optionally, the first state information may include the first user's emotional information, and the first preset condition may include matching the first user's emotional information with any one of a plurality of preset emotional information. The first terminal device may collect data through multiple biosensors and determine the first user's emotional information based on the data collected by the multiple biosensors; this application does not limit the determination of the emotional information. The plurality of first preset emotional information may include, but is not limited to, sadness, happiness, and sorrow, etc., and this is also not limited.
[0084] Step 330: If the second state information is received, then the dialogue content is generated by the dialogue model based on the second state information, or by the dialogue model based on the first state information and the second state information.
[0085] Similar to the method used by the first terminal device to detect the first state information, the second terminal device can also acquire the second state information corresponding to the second user at preset intervals and detect whether the second state information meets the second preset conditions. If the second terminal device detects that the second state information meets the second preset conditions, the second terminal device can send the second state information that meets the second preset conditions to the first terminal device. When the first terminal device receives the second state information sent by the second terminal device, it determines that the second state information corresponding to the second user meets the second preset requirements. The first terminal device can input the second state information or the first state information and the second state information into the dialogue model. The dialogue model can generate dialogue content based on the second state information or the first state information and the second state information.
[0086] Optionally, the second state information includes the second user's second scene information, which is determined by data collected by the position sensor and motion sensor of the second user's second terminal device. The second preset condition for the second user includes matching the second scene information with any one of multiple second preset scene information. The determination of the second state information and the second preset condition can refer to the determination of the first state information and the first preset condition in step 220, which will not be repeated here. However, it should be noted that since the first user and the second user have different lifestyles, the multiple first preset scene information in the first preset condition and the multiple second preset scene information in the second preset condition can also be different.
[0087] Step 340: Obtain the voiceprint features of the target user, generate dialogue speech based on the voiceprint features and dialogue content, and play the dialogue speech.
[0088] In this embodiment, when the first terminal device detects that the first state information meets the first preset condition, it can generate dialogue content through a dialogue model based on the first state information, or through a dialogue model based on the first state information and the second state information. When the second terminal device detects that the second state information meets the second preset condition, it can send the second state information to the first terminal device. The first state information may include first scene information and / or emotion information, and the second state information may include second scene information. When the first terminal device receives the second state information, it can generate dialogue content through a dialogue model based on the second state information, or through a dialogue model based on the first state information and the second state information.
[0089] In implementing this embodiment, the first terminal device can generate dialogue content based on first scene information and / or emotion information, and possibly combined with second state information. The first terminal device can also generate dialogue content based on the second user's second scene information, and possibly combined with the first state information. Therefore, the dialogue content can be more targeted and personalized. The dialogue content can be optimized according to different scenarios and needs of different users, providing a more customized voice dialogue experience. This is equivalent to simulating a dialogue between the second user and the first user, making AI voice chat more realistic. The first terminal device can replace the second user to accompany the first user, enhancing human-computer interaction.
[0090] like Figure 4 As shown, Figure 4 This is a flowchart illustrating another voice dialogue method disclosed in an embodiment of this application. This voice dialogue method can be applied to the first terminal device in the above embodiments, and the voice dialogue method may include the following steps:
[0091] Step 410: Obtain user status information.
[0092] Step 420: Based on the first status information and / or the second status information, determine the target user from among multiple relatives and friends who are relatives and friends of the first user.
[0093] The first terminal device can generate a dialogue index between the first user and each of its relatives and friends based on the first state information and / or the second state information using a preset algorithm. This application does not limit the preset algorithm. The dialogue index can be used to characterize the first user's interest in engaging in dialogue with relatives and friends; the higher the dialogue index, the greater the interest. The first terminal device can identify the relative or friend with the highest dialogue index as the target user. Optionally, the first terminal device can also randomly select the target user from multiple relatives and friends who are related to the first user.
[0094] Step 430: Use the first state information and / or the second state information, as well as the user information of the target user, as input information for the dialogue model.
[0095] The target user's information may include the target user's identity information and the relationship between the target user and the first user.
[0096] Step 440: Input the input information into the dialogue model, and obtain the dialogue features of the target user based on the user information through the dialogue model. Based on the dialogue features, and according to the first state information and / or the second state information, generate dialogue content corresponding to the target user.
[0097] The first terminal device can input the input information into the dialogue model. The dialogue model can obtain the dialogue features of the target user based on the user identity information in the user information. Optionally, the dialogue features may include commonly used words in the dialogue, the dialogue order (e.g., inversion), etc., without limitation. Based on the dialogue features, the dialogue model can generate dialogue content corresponding to the target user according to the first state information and / or the second state information.
[0098] Step 450: Obtain the voiceprint features of the target user, generate dialogue speech based on the voiceprint features and dialogue content, and play the dialogue speech.
[0099] In this embodiment, the first terminal device can determine the target user and obtain its dialogue features based on the first state information and / or the second state information. Based on the dialogue features, the device generates personalized dialogue content corresponding to the target user through a dialogue model, thereby providing a voice companionship experience that better meets the needs of the first user.
[0100] like Figure 5 As shown, Figure 5This is a flowchart illustrating another voice dialogue method disclosed in an embodiment of this application. This voice dialogue method can be applied to the first terminal device in the above embodiments, and the voice dialogue method may include the following steps:
[0101] Step 510: Obtain user status information.
[0102] Step 520: Based on the first status information and / or the second status information, determine the target user from among multiple relatives and friends who are relatives and friends of the first user.
[0103] Step 530: Use the first state information and / or the second state information, as well as the target user's user information, as input information for the dialogue model.
[0104] The methods in steps 510 to 530 can refer to the methods in steps 410 to 430 in the above embodiments, and will not be repeated here.
[0105] Step 540: If the input information is the same as the historical input information, then obtain the historical information corresponding to the historical input information.
[0106] The historical information may include historical input information, historical dialogue content corresponding to the historical input information, and historical response content of the first user. The first terminal device can compare the current input information with the historical input information included in each historical information. If the input information is the same as the historical input information, the first terminal device can obtain the historical information corresponding to that historical input information. The historical dialogue content corresponding to the historical input information can be generated by the first terminal device inputting the historical input information into the dialogue model, and the historical response content of the first user can be determined based on the first user's response voice to the historical dialogue content.
[0107] Step 550: Input the input information and historical information into the dialogue model, and obtain the dialogue features of the target user based on the user information through the dialogue model. Based on the dialogue features, the dialogue content corresponding to the target user is generated according to the first state information and / or the second state information, as well as the historical information.
[0108] The first terminal device can input input information and historical information into the dialogue model. When generating dialogue content, the dialogue model can also refer to this historical information. Optionally, the dialogue model can determine the first user's historical satisfaction with the historical dialogue content based on the historical responses in the historical information. If the satisfaction level is greater than a satisfaction threshold, the historical dialogue content can be used as positive reference content. Based on dialogue features and according to the first state information and / or the second state information, dialogue content similar to the historical dialogue content can be generated. If the satisfaction level is not greater than the satisfaction threshold, the historical dialogue content can be used as negative reference content. Based on dialogue features and according to the first state information and / or the second state information, dialogue content dissimilar to the historical dialogue content can be generated.
[0109] Step 560: Obtain the voiceprint features of the target user, generate dialogue speech based on the voiceprint features and dialogue content, and play the dialogue speech.
[0110] Step 570: The first user's reply voice in response to the dialogue voice was detected.
[0111] After the dialogue voice is played, the first terminal device can detect the first user's reply voice in response to the dialogue voice through the audio detection device.
[0112] Step 580: Perform content recognition on the reply voice to determine the reply content corresponding to the reply voice.
[0113] In this embodiment, the first terminal device can use a speech recognition algorithm to identify the content of the reply speech and determine the corresponding reply content. This application does not limit the speech recognition algorithm. In one embodiment, after the first terminal device determines the reply content, it can send the reply content and the dialogue content to the target user's terminal device, so that the target user can understand the dialogue simulated between the target user and the first user, avoiding information gaps between the target user and the first user.
[0114] Step 590: Save the reply content, dialogue content, and input information as historical information.
[0115] In this embodiment, the first terminal device can also detect the first user's reply to the dialogue voice after playing the dialogue voice, and perform content recognition on the reply voice to determine the reply content corresponding to the reply voice. Then, the reply content, dialogue content, and input information are saved as historical information. Thus, after each determination of input information, the input information can be compared with the historical input information in the historical information. When the input information is the same as the historical input information, the historical information corresponding to the historical input information can be obtained. Thus, the input information and historical information can be input into the dialogue model, so that the dialogue model can generate dialogue content corresponding to the target user based on the first state information and / or the second state information, as well as the historical information. By referring to the historical information, the accuracy of the dialogue content is further improved.
[0116] like Figure 6 As shown, Figure 6 This is a modular schematic diagram of a voice dialogue device disclosed in an embodiment of this application. The voice dialogue device 600 can be applied to the first terminal device in the above embodiments. The voice dialogue device 600 may include an information acquisition module 610, a content generation module 620, and a voice playback module 630, wherein:
[0117] The information acquisition module is used to acquire user status information, which includes first status information corresponding to the first user of the first terminal device, and / or second status information corresponding to the second user, wherein the second user and the first user are relatives or friends.
[0118] The content generation module is used to generate dialogue content based on the first state information and / or the second state information through the dialogue model;
[0119] The voice playback module is used to acquire the voiceprint features of the target user, generate dialogue voice based on the voiceprint features and dialogue content, and play the dialogue voice; wherein, the target user includes a second user, or other relatives or friends who are friends or family members of the first user.
[0120] In one embodiment, the content generation module is further configured to, if it is detected that the first state information meets the first preset condition, generate dialogue content through a dialogue model based on the first state information, or through a dialogue model based on the first state information and the second state information; the first state information includes the first scene information of the first user, and the first preset condition includes: the first scene information matches any one of a plurality of first preset scene information; and / or, the first state information includes the first user's emotion information, and the first preset condition includes: the first user's emotion information matches any one of a plurality of preset emotion information.
[0121] In one embodiment, the content generation module is further configured to, if receiving second state information, generate dialogue content based on the second state information through a dialogue model, or based on the first state information and the second state information through a dialogue model; wherein, the second state information is sent by the second terminal device when it detects that the second state information meets a second preset condition; the second state information includes the second scene information of the second user, and the second preset condition of the second user includes: the second scene information matches any one of a plurality of second preset scene information.
[0122] In one embodiment, the content generation module is further configured to: determine a target user from multiple relatives and friends who are relatives and friends of the first user, based on the first state information and / or the second state information; use the first state information and / or the second state information, as well as the user information of the target user, as input information to the dialogue model; input the input information into the dialogue model, and obtain the dialogue features of the target user based on the user information through the dialogue model, and generate dialogue content corresponding to the target user based on the dialogue features and the first state information and / or the second state information.
[0123] In one embodiment, the content generation module is further configured to: if the input information is the same as the historical input information, obtain the historical information corresponding to the historical input information, the historical information including the historical input information, the historical dialogue content corresponding to the historical input information, and the historical reply content of the first user; input the input information into the dialogue model, obtain the dialogue features of the target user through the dialogue model based on the user information, and generate dialogue content corresponding to the target user based on the dialogue features, according to the first state information and / or the second state information, including: inputting the input information and historical information into the dialogue model, obtaining the dialogue features of the target user through the dialogue model based on the user information, and generating dialogue content corresponding to the target user based on the dialogue features, according to the first state information and / or the second state information, and the historical information.
[0124] In one embodiment, the voice dialogue device further includes an information storage module, which is used to detect the first user's reply voice in response to the dialogue voice; perform content recognition on the reply voice to determine the reply content corresponding to the reply voice; and save the reply content, dialogue content, and input information as historical information.
[0125] In one embodiment, the voice dialogue device further includes an information sending module for sending the reply content and dialogue content to the target user's terminal device.
[0126] In this embodiment, the first terminal device can generate dialogue content based on user status information through a dialogue model, and generate dialogue voice based on the target user's voiceprint information and the dialogue content. The personalized dialogue content and dialogue voice can provide the first user with a more intimate and comfortable voice dialogue experience. Moreover, the first terminal device can actively initiate voice dialogue based on user status information, thereby enhancing the companionship effect provided by AI voice chat and improving the human-computer interaction of the terminal device.
[0127] like Figure 7 As shown, in one embodiment, a terminal device is provided, which may include:
[0128] Memory 710 storing executable program code;
[0129] Processor 720 coupled to memory 710;
[0130] The processor 720 can call the executable program code stored in the memory 710 to implement the voice dialogue method provided in the above embodiments.
[0131] The memory 710 may include random access memory (RAM) or read-only memory (ROM). The memory 710 can be used to store instructions, programs, code, code sets, or instruction sets. The memory 710 may include a program storage area and a data storage area. The program storage area may store instructions for implementing an operating system, instructions for implementing at least one function (such as touch functionality, sound playback functionality, image playback functionality, etc.), and instructions for implementing the various method embodiments described above. The data storage area may also store data created by the terminal device during use.
[0132] Processor 720 may include one or more processing cores. Processor 720 connects to various parts of the terminal device using various interfaces and lines, and performs various functions and processes data by running or executing instructions, programs, code sets, or instruction sets stored in memory 710, and by calling data stored in memory 710. Optionally, processor 720 may be implemented using at least one hardware form of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), or Programmable Logic Array (PLA). Processor 720 may integrate one or more of the following: Central Processing Unit (CPU), Graphics Processing Unit (GPU), and modem. The CPU primarily handles the operating system, user interface, and applications; the GPU is responsible for rendering and drawing the displayed content; and the modem handles wireless communication. It is understood that the modem may also not be integrated into processor 720 and may be implemented separately using a communication chip.
[0133] Understandably, the terminal device may include more or fewer structural elements than those shown in the above block diagram, such as a power module, physical buttons, WiFi (Wireless Fidelity) module, speaker, Bluetooth module, sensor, etc., and may not be limited thereto.
[0134] This application discloses a computer-readable storage medium storing a computer program that causes a computer to perform the methods described in the above embodiments.
[0135] Furthermore, this application further discloses a computer program product that, when run on a computer, enables the computer to execute all or part of the steps in any of the voice dialogue methods described in the above embodiments.
[0136] Those skilled in the art will understand that all or part of the steps in the various methods of the above embodiments can be implemented by a program instructing related hardware. The program can be stored in a computer-readable storage medium, including read-only memory (ROM), random access memory (RAM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), one-time programmable read-only memory (OTPROM), electrically-Erasable Programmable Read-Only Memory (EEPROM), compact disc read-only memory (CD-ROM) or other optical disc storage, disk storage, magnetic tape storage, or any other computer-readable medium capable of carrying or storing data.
[0137] The above provides a detailed description of a voice dialogue method, apparatus, terminal device, and storage medium disclosed in the embodiments of this application. Specific examples have been used to illustrate the principles and implementation methods of this application. The descriptions of the above embodiments are only for the purpose of helping to understand the method and core ideas of this application. At the same time, for those skilled in the art, there will be changes in the specific implementation methods and application scope based on the ideas of this application. Therefore, the content of this specification should not be construed as a limitation of this application.
Claims
1. A voice dialogue method, characterized in that, Applied to a first terminal device, the method includes: Obtain user status information, which includes first status information corresponding to a first user of the first terminal device and second status information corresponding to a second user, wherein the second user and the first user are relatives or friends. The dialogue model generates dialogue content based on the first state information and the second state information. The voiceprint features of the target user are obtained, and a dialogue voice is generated based on the voiceprint features and the dialogue content, and the dialogue voice is played; wherein, the target user includes the second user, or other relatives or friends who are relatives or friends of the first user; The step of generating dialogue content based on the first state information and the second state information through a dialogue model includes: Based on the first status information and the second status information, the target user is determined from multiple relatives and friends who are related to the first user. The first state information, the second state information, and the user information of the target user are used as input information for the dialogue model. The input information is input into the dialogue model, and the dialogue model obtains the dialogue features of the target user based on the user information. Based on the dialogue features, and according to the first state information and the second state information, dialogue content corresponding to the target user is generated.
2. The method according to claim 1, characterized in that, The step of generating dialogue content based on the first state information and the second state information through a dialogue model includes: If the first state information is detected to meet the first preset condition, the dialogue model generates dialogue content based on the first state information and the second state information. The first status information includes the first scene information of the first user, and the first preset condition includes: the first scene information matches any one of a plurality of first preset scene information; and / or, The first state information includes the first user's emotional information, and the first preset condition includes: the first user's emotional information matches any one of a plurality of preset emotional information.
3. The method according to claim 1, characterized in that, The step of generating dialogue content based on the first state information and the second state information through a dialogue model includes: If the second status information is received, the dialogue model generates dialogue content based on the first status information and the second status information; wherein, the second status information is sent by the second terminal device when it detects that the second status information meets the second preset condition; The second status information includes the second scene information of the second user, and the second preset condition of the second user includes: the second scene information matches any one of a plurality of second preset scene information.
4. The method according to claim 1, characterized in that, After taking the first state information, the second state information, and the target user's user information as input information for the dialogue model, the method further includes: If the input information is the same as the historical input information, then the historical information corresponding to the historical input information is obtained. The historical information includes the historical input information, the historical dialogue content corresponding to the historical input information, and the historical reply content of the first user. The step of inputting the input information into a dialogue model, obtaining the dialogue features of the target user based on the user information through the dialogue model, and generating dialogue content corresponding to the target user based on the dialogue features, the first state information, and the second state information includes: The input information and the historical information are input into the dialogue model, and the dialogue model obtains the dialogue features of the target user based on the user information. Based on the dialogue features, the dialogue content corresponding to the target user is generated according to the first state information, the second state information, and the historical information.
5. The method according to claim 4, characterized in that, After playing the dialogue voice, the method further includes: The first user's response to the dialogue was detected. The content of the reply voice is recognized to determine the reply content corresponding to the reply voice; The reply content, the dialogue content, and the input information are saved as historical information.
6. The method according to claim 5, characterized in that, After performing content recognition on the response voice and determining the response content corresponding to the response voice, the method further includes: The reply and the conversation are sent to the target user's terminal device.
7. A voice dialogue device, characterized in that, Applied to a first terminal device, the device includes: The information acquisition module is used to acquire user status information, which includes first status information corresponding to a first user of the first terminal device and second status information corresponding to a second user, wherein the second user and the first user are relatives or friends. The content generation module is used to generate dialogue content based on the first state information and the second state information through the dialogue model; The voice playback module is used to acquire the voiceprint features of the target user, generate dialogue voice based on the voiceprint features and the dialogue content, and play the dialogue voice; wherein, the target user includes the second user, or other relatives or friends who are relatives or friends of the first user; The content generation module is specifically configured to: determine a target user from multiple friends and relatives of the first user based on the first state information and the second state information; use the first state information, the second state information, and the user information of the target user as input information for a dialogue model; input the input information into the dialogue model, and obtain the dialogue features of the target user based on the user information through the dialogue model, and generate dialogue content corresponding to the target user based on the dialogue features and the first state information and the second state information.
8. A terminal device, characterized in that, include: Memory containing executable program code; A processor coupled to the memory; The processor calls the executable program code stored in the memory to execute the method according to any one of claims 1 to 6.
9. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program, wherein, when executed by a processor, the computer program causes the processor to perform the method according to any one of claims 1 to 6.