A voice interaction method and a voice interaction system
By recognizing the target user's voice commands on the display device and sending them to the associated information output device to output feedback information, the interference problem of the display device in multi-user scenarios is solved, and interference-free feedback information transmission is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HISENSE VISUAL TECH CO LTD
- Filing Date
- 2026-03-09
- Publication Date
- 2026-06-12
AI Technical Summary
The problem of displaying feedback information in multi-user scenarios causing interference to other users has not been completely solved by existing technologies using small window display methods.
By acquiring the target user's voice commands, identifying the user's identity, generating feedback information, and sending it to the associated information output device for output, the feedback information is avoided from being directly displayed on the display device.
It completely solves the problem of interference between display device feedback information and other users, and realizes feedback information output dedicated to the target user.
Smart Images

Figure CN122201284A_ABST
Abstract
Description
Technical Field
[0001] This application belongs to the field of human-computer interaction technology, and in particular relates to a voice interaction method and a voice interaction system. Background Technology
[0002] Currently, some display devices can display feedback information corresponding to a user's voice command. For example, the display device can be a television, and the user's voice command can be a voice command corresponding to the content displayed on the television (for example, if the television displays a football match, the voice command can be "Give information about the head coach of the red team"). Based on this, the display device can generate and display feedback information corresponding to the user's voice command (for example, displaying an image corresponding to "information about the head coach of the red team" or outputting the voice corresponding to "information about the head coach of the red team").
[0003] In practical applications, there are often situations where multiple users use the same display device at the same time (for example, multiple users watching a TV at the same time). In this case, if one user outputs a voice command to the display device and the display device displays the corresponding feedback information for the voice command, it will interfere with other users.
[0004] Currently, the common method to solve the above problems is to display the feedback information corresponding to the voice command in a small window on the display device. However, this method can only reduce the interference of the feedback information displayed by the display device to other users to a certain extent, and does not completely solve the problem of the feedback information displayed by the display device interfering with other users. Summary of the Invention
[0005] In view of this, embodiments of this application provide a voice interaction method and a voice interaction system to solve the problem in the prior art where feedback information displayed by a display device interferes with other users.
[0006] In a first aspect, embodiments of this application provide a voice interaction method applied to a display device, the method comprising: Obtain a target voice command output by a target user who is currently using the display device; the target voice command is output by the target user based on the current display content of the display device. Based on the voice characteristics of the target voice command, obtain the identity information of the target user, and based on the identity information, identify the target information output device associated with the target user; Based on the semantic information of the target voice command, the current display content, and the device information of the target information output device, generate first feedback information corresponding to the target voice command; If the target information output device is online, the first feedback information is sent to the target information output device to instruct the target information output device to output the first feedback information to the target user.
[0007] Optionally, obtaining the target voice command output by the target user using the display device includes: Acquire user voice information output by any user currently using the display device; For each segment of user voice information, if the type of the user voice information segment is a question type or an instruction type, and the relevance between the user voice information segment and the currently displayed content is greater than a preset relevance threshold, then the user voice information segment is identified as the target voice instruction output by the target user.
[0008] Optionally, obtaining user voice information output by any user currently using the display device includes: For any voice information acquired by the display device, the voice features of the arbitrary voice information are matched with pre-stored voice features. If the match is successful, the arbitrary voice information is identified as the user's voice information. The pre-stored voice features include the voice features corresponding to each user associated with the display device.
[0009] Optionally, obtaining the target voice command output by the target user using the display device further includes: The system receives any language information sent by any information output device associated with the display device. If the type of the arbitrary voice information is a question or an instruction, and the correlation between the arbitrary voice information and the currently displayed content is greater than a preset correlation threshold, then the arbitrary language information is identified as a target language instruction output by the target user.
[0010] Optionally, obtaining the target user's identity information based on the voice features of the target voice command includes: Obtain the mapping relationship between the voiceprint information of each user and the identity information of each user; Based on the speech features of the target speech command, extract the voiceprint information of the target speech command; Based on the voiceprint information and the mapping relationship, the identity information of the target user corresponding to the target voice command is matched from each candidate identity information.
[0011] Optionally, generating first feedback information corresponding to the target voice command based on the semantic information of the target voice command, the currently displayed content, and the device information of the target information output device includes: Based on the device information of the target information output device, the device type of the target information output device is identified; the device type is at least divided into: a first type, a second type and a third type, wherein the first type of target information output device only includes a voice output module, the second type of target information output device only includes an image output module, and the third type of target information output device includes both an image output module and a voice output module; If the device type is the first type, then based on the semantic information and the currently displayed content, the first feedback information, which only includes voice information, is generated; If the device type is the second type, then based on the semantic information and the currently displayed content, the first feedback information, which only includes image information, is generated; If the device type is the third type, then the first feedback information, which includes both voice information and image information, is generated based on the semantic information and the currently displayed content.
[0012] Optionally, after generating first feedback information corresponding to the target voice command based on the semantic information of the target voice command, the currently displayed content, and the device information of the target information output device, the method further includes: If the target information output device is offline, a second feedback information including both voice and image information is generated based on the semantic information and the current display content, and the second feedback information is output to the target user through the display device.
[0013] Secondly, embodiments of this application provide a voice interaction method applied to an information output device, the method comprising: Receive the first feedback information sent by the display device; The first feedback information is output to the target user associated with the information output device; The first feedback information is generated by the display device based on the semantic information of the target voice command, the current display content of the display device, and the device information of the information output device, and is sent to the information output device when the information output device is in an online state; the target voice command is output by the target user based on the current display content of the display device.
[0014] Optionally, the method further includes: After acquiring any voice information output by the target user associated with the information output device, the arbitrary voice information is sent to the display device to instruct the display device to determine whether to recognize the arbitrary voice information as the target voice command.
[0015] Thirdly, embodiments of this application provide a display device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that the processor executes the computer program to implement the voice interaction method as described in any of the first aspects.
[0016] Fourthly, embodiments of this application provide an information output device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that the processor executes the computer program to implement the voice interaction method as described in any of the second aspects.
[0017] Fifthly, embodiments of this application provide a voice interaction system, including a display device as described in the third aspect and an information output device as described in the fourth aspect.
[0018] Sixthly, embodiments of this application provide a computer program product that, when executed by a processor, implements the steps of the voice interaction method as described in any of the first aspects, or implements the steps of the voice interaction method as described in any of the second aspects.
[0019] The voice interaction method and voice interaction system provided in this application have the following beneficial effects: The voice interaction method provided in this application can be applied to a display device. First, the display device can acquire a target voice command output by a target user currently using the display device. This target voice command is output by the target user based on the current display content of the display device. Then, the display device can acquire the target user's identity information based on the voice characteristics of the target voice command, and identify the target information output device associated with the target user based on the identity information. Next, the display device can generate first feedback information corresponding to the target voice command based on the semantic information of the target voice command, the current display content, and the device information of the target information output device. Finally, if the target information output device is online, the display device can send the first feedback information to the target information output device to instruct it to output the first feedback information to the target user. Through this method, when the target information output device is online, the display device can send the feedback information corresponding to the target voice command to the target information output device associated with the target user corresponding to the target voice command, allowing the target information output device to output the feedback information in place of the display device. This eliminates the need for the display device to display feedback information, completely solving the problem of the display device displaying feedback information interfering with other users. Attached Figure Description
[0020] To more clearly illustrate the technical solutions in the embodiments of this application, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0021] Figure 1 A flowchart illustrating the implementation of a voice interaction method provided in this application embodiment; Figure 2 A flowchart illustrating the implementation of a method for generating feedback information provided in an embodiment of this application; Figure 3 A flowchart illustrating the implementation of a method for outputting feedback information provided in an embodiment of this application; Figure 4 A flowchart illustrating the implementation of a voice interaction method according to another embodiment of this application; Figure 5 A flowchart illustrating the implementation of a method for determining a target voice command, provided in an embodiment of this application; Figure 6 This is a schematic diagram of the structure of a voice interaction system provided in an embodiment of this application; Figure 7 This is a schematic diagram of the structure of a display device provided in an embodiment of this application; Figure 8 This is a schematic diagram of the structure of an information output device provided in an embodiment of this application; Figure 9 This is a schematic diagram of the structure of a display device provided in another embodiment of this application; Figure 10 This is a schematic diagram of the structure of an information output device provided in another embodiment of this application. Detailed Implementation
[0022] It should be noted that the terminology used in the embodiments of this application is only for explaining specific embodiments of this application and is not intended to limit this application. In the description of the embodiments of this application, unless otherwise stated, "multiple" means two or more, "at least one" or "one or more" means one, two or more. The terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of indicated technical features. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature.
[0023] References to "one embodiment" or "some embodiments" as described in this specification mean that one or more embodiments of this application include a specific feature, structure, or characteristic described in connection with that embodiment. Therefore, the phrases "in one embodiment," "in some embodiments," "in other embodiments," "in still other embodiments," etc., appearing in different parts of this specification do not necessarily refer to the same embodiment, but rather mean "one or more, but not all, embodiments," unless otherwise specifically emphasized. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless otherwise specifically emphasized.
[0024] The voice interaction method provided in this application can be applied to display devices and information output devices. The display device can include, but is not limited to, electronic devices such as televisions, computers, mobile phones, and projectors. The information output device can be used to output feedback information. Since the feedback information can include image information and voice information, the information output device can be any type of image information output device, voice information output device, or electronic device that can output both image and voice information. For example, the information output device can include, but is not limited to, mobile phones, computers, headphones, and augmented reality glasses (AR glasses).
[0025] When the voice interaction method provided in this application is applied to a display device, the execution subject of the voice interaction method provided in this application is the display device; when the voice interaction method provided in this application is applied to an information output device, the execution subject of the voice interaction method provided in this application is the information output device.
[0026] The voice interaction method provided in this application can be applied to any scenario where a user interacts with a display device. For example, if multiple users are watching the same TV at the same time, in order to avoid the TV displaying feedback information corresponding to the voice command output by a target user from interfering with other users, the various steps of the voice interaction method provided in this application can be executed by the display device and the information output device, thereby fundamentally solving the problem that the display device's display of feedback information will interfere with other users besides the target user.
[0027] Please see Figure 1 , Figure 1 This is a flowchart illustrating the implementation of a voice interaction method provided in an embodiment of this application. The voice interaction method provided in this application can be applied to a display device, and the voice interaction method may include steps S101 to S104, as detailed below: In S101, the target voice command output by the target user who is using the display device is obtained; the target voice command is output by the target user based on the current display content of the display device.
[0028] In the embodiments of this application, the display device can acquire the target voice command through the following two possible implementation methods.
[0029] In the first possible implementation, the display device can first obtain user voice information output by any user who is using the display device. Then, for each segment of user voice information, if the type of the user voice information is a question type or an instruction type, and the correlation between the user voice information and the current display content is greater than a preset correlation threshold, then the user voice information is identified as the target voice instruction output by the target user.
[0030] The display device can obtain user voice information output by any user currently using the display device in the following ways: For any voice information acquired by the display device, the voice features of the arbitrary voice information are matched with pre-stored voice features. If the match is successful, the arbitrary voice information is identified as user voice information. The pre-stored voice features include the voice features corresponding to each user associated with the display device.
[0031] In this implementation, the user's voice information can be voice information output by the user associated with the display device. Here, "user-output voice information" refers to voice information directly output by the user; that is, voice information sent by the user to the display device through an information output device cannot be considered user-output voice information.
[0032] The following is an example of the first possible implementation: The display device can pre-store the voice features of each user associated with the display device (users can be associated with the display device by registering on the display device). After the display device obtains any voice information (any voice information can include voice information output by users associated with the display device and voice information output by users not associated with the display device), the display device can match the voice features of the arbitrary voice information obtained by the display device with the pre-stored voice features. If the match is successful, the arbitrary voice information is identified as the user's voice information.
[0033] For example, when user a, associated with the display device, user b, associated with the display device, and user c, not associated with the display device, all use the display device simultaneously, if the voice information obtained by the display device is voice information output by user a, then the display device can identify the voice information as user voice information; if the voice information obtained by the display device is voice information output by user b, then the display device can identify the voice information as user voice information; if the voice information obtained by the display device is voice information output by user c, then the display device may not identify the voice information as user voice information.
[0034] After acquiring the user's voice information, the display device can determine, using a preset method, whether the voice information is a question or a command, and whether the relevance between the voice information and the currently displayed content is greater than a preset relevance threshold. If both are true, the display device can identify the voice information as the target voice command and continue executing subsequent steps. If the voice information is neither a question nor a command, or if the relevance between the voice information and the currently displayed content is not greater than the preset relevance threshold, the display device may not identify the voice information as the target voice command and therefore will not execute subsequent steps.
[0035] For example, if the user's voice message is "Give information about the head coach of the Red Team", then the type of the user's voice message can be determined to be an instruction type. If the user's voice message is "Who is the head coach of the Red Team?", then the type of the user's voice message can be determined to be a question type. If the user's voice message is "This stadium is really beautiful", then the type of the user's voice message can be determined to be neither an instruction type nor a question type.
[0036] For example, if the current content displayed on the display device is a football match, and the user's voice message is "Give me the information on the head coach of the red team", then it can be determined that the relevance between the user's voice message and the current display content is greater than a preset relevance threshold. If the user's voice message is "How is the weather today?", then it can be determined that the relevance between the user's voice message and the current display content is not greater than a preset relevance threshold.
[0037] In the second possible implementation, the display device can receive any language information sent by any information output device associated with the display device. If the type of the arbitrary voice information is a question or an instruction, and the relevance between the arbitrary voice information and the current display content is greater than a preset relevance threshold, then the arbitrary language information is identified as a target language instruction output by the target user.
[0038] In the second implementation method, the specific implementation method for determining whether the type of any audio information segment is a question type or an instruction type, and the specific implementation method for determining whether the relevance between any audio information segment and the currently displayed content is greater than a preset relevance threshold, can be referred to the corresponding description of the first implementation method, and will not be repeated here.
[0039] The difference between the first and second implementation methods is as follows: In the first implementation method, the user directly outputs voice information to the display device (that is, there is no need to use an information output device to send the user's voice information to the display device). Therefore, the display device obtains the target voice command from the voice information directly output by the user. In the second implementation method, the user first outputs voice information to the information output device, and then the information output device sends the voice information to the display device. Therefore, the display device obtains the target voice command from the voice information sent by the information output device.
[0040] Based on the above differences, it can be understood that the first implementation method can be applied to scenarios where the information output device does not have the function of acquiring voice information, while the second implementation method can be applied to scenarios where the information output device has the function of acquiring voice information.
[0041] In S102, the identity information of the target user is obtained based on the voice characteristics of the target voice command, and the target information output device associated with the target user is identified based on the identity information.
[0042] In this embodiment of the application, after the display device obtains the target voice command, it can first obtain the identity information of the target user who outputs the target voice command based on the voice characteristics of the target voice command.
[0043] In one possible implementation, the display device can first obtain a mapping relationship describing the connection between each user's voiceprint information and their identity information. Specifically, any user who needs to use the display device can pre-input their voiceprint information and identity information into the display device, so that the display device can obtain and store the mapping relationship describing the connection between each user's voiceprint information and their identity information.
[0044] After obtaining the mapping relationship between the voiceprint information of each user and the identity information of each user, the display device can extract the voiceprint information of the target voice command based on the voice characteristics of the target voice command.
[0045] After extracting the voiceprint information of the target voice command, the display device can match the target user's identity information from a pool of candidate identity information based on the voiceprint information and the mapping relationship between the voiceprint information and the user's identity information. The candidate identity information can be various identity information pre-stored by the display device.
[0046] After the identity information is determined, the display device can identify the target information output device associated with the target user based on the identity information.
[0047] In one possible implementation, the display device can first obtain a mapping relationship describing the connection between identity information and information output devices. Specifically, any user who needs to use the display device can pre-input their identity information and the corresponding information output device into the display device, so that the display device can obtain and store the mapping relationship describing the connection between identity information and information output devices.
[0048] After obtaining the mapping relationship describing the connection between identity information and information output device, the display device can identify the target information output device associated with the target user based on the identity information and the mapping relationship describing the connection between identity information and information output device.
[0049] In S103, first feedback information corresponding to the target voice command is generated based on the semantic information of the target voice command, the current display content, and the device information of the target information output device.
[0050] In this embodiment of the application, after determining the target information output device associated with the target user, the display device can... Figure 2 The method shown generates the first feedback information corresponding to the target voice command.
[0051] Please see Figure 2 , Figure 2 This application provides a flowchart of a method for generating feedback information, which may include steps S201-S204, as detailed below: In S201, the device type of the target information output device is identified based on the device information of the target information output device.
[0052] In this implementation, the device types are at least divided into a first type, a second type, and a third type. The first type of target information output device includes only a voice output module, the second type of target information output device includes only an image output module, and the third type of target information output device includes both an image output module and a voice output module. For example, the first type of target information output device includes, but is not limited to, a sound player, such as headphones or speakers; the second type of target information output device includes, but is not limited to, an image display that does not include a voice output module; and the third type of target information output device includes, but is not limited to, mobile phones, computers, headphones, and AR glasses.
[0053] Specifically, users can pre-input the device type of each information output device into the display device, so that the display device can determine the device type of the target information output device.
[0054] In S202, if the device type is the first type, then based on the semantic information and the currently displayed content, a first feedback message including only voice information is generated.
[0055] In this implementation, if the device type is the first type, that is, the target information output device only includes a voice output module, the display device can generate first feedback information that only includes voice information based on semantic information and the current display content.
[0056] For example, if the semantic information of the target voice command is "provide information on the head coach of the red team", and the currently displayed content is a football match, the display device can generate only the voice information describing the information on the head coach of the red team in the currently displayed football match as the first feedback information.
[0057] In S203, if the device type is the second type, then based on the semantic information and the currently displayed content, a first feedback information including only image information is generated.
[0058] In this implementation, if the device type is the second type, that is, the target information output device only includes an image output module, the display device can generate first feedback information that only includes image information based on semantic information and the current display content.
[0059] For example, if the semantic information of the target voice command is "give information about the head coach of the red team", and the current displayed content is a football match, the display device can generate only image information describing the information about the head coach of the red team in the currently displayed football match as the first feedback information.
[0060] In S204, if the device type is a third type, then a first feedback message that includes both voice information and image information is generated based on the semantic information and the currently displayed content.
[0061] In this implementation, if the device type is the third type, that is, the target information output device includes both an image output module and a voice output module, the display device can generate first feedback information that includes both voice information and image information based on semantic information and the current display content.
[0062] For example, if the semantic information of the target voice command is "provide information on the head coach of the red team", and the currently displayed content is a football match, the display device can generate voice information and image information describing the information on the head coach of the red team in the currently displayed football match as the first feedback information.
[0063] It should be noted that, in one possible implementation, after determining the target information output device, the display device can first determine the current state of the target information output device. The current state of the target information output device can include an online state and an offline state. If the current state of the target information output device is online, the display device can execute S103, that is, the display device can generate the first feedback information corresponding to the target voice command through S103. If the current state of the target information output device is offline, the display device can choose not to execute S103, that is, the display device can choose not to generate the first feedback information corresponding to the target voice command.
[0064] In S104, if the target information output device is online, the first feedback information is sent to the target information output device to instruct the target information output device to output the first feedback information to the target user.
[0065] In this embodiment of the application, if the display device determines that the current state of the target information output device is online, the display device can send the first feedback information to the target information output device to instruct the target information output device to output the first feedback information to the target user.
[0066] Specifically, if the target information output device is a first type of information output device, the target information output device can output first feedback information to the target user that only includes voice information; if the target information output device is a second type of information output device, the target information output device can output first feedback information to the target user that only includes image information; if the target information output device is a third type of information output device, the target information output device can output first feedback information to the target user that includes both voice information and image information.
[0067] In practical applications, the output of feedback information by the target information output device associated with the target user usually does not interfere with other users.
[0068] As an example and not a limitation, the target information output device may include AR glasses and a bone conduction speaker. Based on this, the image information in the first feedback information can be output to the target user through the AR glasses, and the voice information in the first feedback information can be output to the target user through the bone conduction speaker. It can be seen that since the image information displayed by the AR glasses can only be received by the corresponding target user, and the voice information output by the bone conduction speaker can only be received by the corresponding target user, it will not cause any interference to other users.
[0069] As an example and not a limitation, the target information output device may also include a mobile phone and a headset. Based on this, the image information in the first feedback information can be output to the target user through the mobile phone, and the voice information in the feedback information can be output to the target user through the headset. It can be seen that since the image information displayed by the mobile phone can only be received by the corresponding target user, and the voice information output by the headset can only be received by the corresponding target user, it will not cause any interference to other users.
[0070] In one possible implementation, if the target information output device is currently offline, the display device can perform actions such as... Figure 3 The steps corresponding to the method shown.
[0071] Please see Figure 3 , Figure 3 This application provides a flowchart of a method for outputting feedback information, which may include step S301, as detailed below: In S301, if the target information output device is offline, a second feedback information including both voice and image information is generated based on the semantic information and the current display content, and the second feedback information is output to the target user through the display device.
[0072] In this implementation, the display device can include both an image output module and a voice output module. Based on this, after determining the target information output device, the display device can first determine the current state of the target information output device. If the current state of the target information output device is offline, the display device can generate second feedback information that includes both voice information and image information based on semantic information and the current display content.
[0073] For example, if the semantic information of the target voice command is "provide information on the head coach of the Red Team", and the currently displayed content is a football match, the display device can generate voice information and image information describing the information on the head coach of the Red Team in the currently displayed football match, which together serve as the second feedback information.
[0074] In one possible implementation, after determining that the current state of the target information output device is offline, the display device can determine whether the previously generated first feedback information includes both voice information and image information. If the display device determines that the generated first feedback information includes both voice information and image information, then the first feedback information including both voice information and image information can be determined as the second feedback information.
[0075] Subsequently, the display device can output the image information from the second feedback information through the image output module and the voice information from the second feedback information through the voice output module. Optionally, to minimize interference with other users, the display device can display the image information from the second feedback information in a small window.
[0076] In this embodiment of the application, the display device can also determine the current state of the target information output device at a preset frequency, and automatically connect to the target information output device in the online state when the state of the target information output device is determined to be online, thereby avoiding outputting feedback information to the target user through the display device as much as possible, so as to minimize interference to other users.
[0077] In one possible implementation, if the display device, after acquiring the target voice command, cannot determine the target user corresponding to the target voice command, and / or cannot determine the target information output device associated with the target user, it can generate second feedback information and output the second feedback information to the target user to ensure that the target user can receive the feedback information; or, if the display device, after acquiring the target voice command, cannot determine the target user corresponding to the target voice command, and / or cannot determine the target information output device associated with the target user, the display device may not output feedback information to the target user to avoid interference to other users.
[0078] In this embodiment of the application, if the display device acquires multiple target voice commands, the display device executes the following according to each acquired target voice command: Figures 1 to 3 The steps of the voice interaction method shown enable the display device to interact with multiple different target users simultaneously via voice.
[0079] In one possible implementation, if multiple voice commands overlap—that is, if two or more target users simultaneously output voice commands to the display device, for example, one target user outputs the voice command "Who is the head coach of the Red Team?" and another user outputs the voice command "Who is the goalkeeper of the Red Team?"—the display device acquires the voice command resulting from the overlap of the two voice commands, for example, the overlapped voice command being "Who is the head coach of the Red Team? Who is the goalkeeper of the Red Team?". The display device can first determine the two voice features corresponding to the overlapped voice command and perform command separation processing on the overlapped voice command to obtain the voice commands corresponding to each of the two language features. For example, based on the two voice features corresponding to the overlapped voice command, the overlapped voice command "Who is the head coach of the Red Team? Who is the goalkeeper of the Red Team?" can be separated into two different voice commands: "Who is the head coach of the Red Team?" and "Who is the goalkeeper of the Red Team?". Then, based on each acquired voice command, the following can be executed: Figure 1 The steps of the voice interaction method shown enable the display device to interact with multiple different target users simultaneously via voice.
[0080] As can be seen from the above, the voice interaction method provided in this application can be applied to a display device. First, the display device can obtain the target voice command output by the target user currently using the display device. The target voice command is output by the target user based on the current display content of the display device. Then, the display device can determine the target user's identity information based on the voice characteristics of the target voice command, and determine the target information output device associated with the target user based on the identity information. Next, the display device can generate first feedback information corresponding to the target voice command based on the semantic information of the target voice command, the current display content, and the device information of the target information output device. Finally, if the target information output device is online, the display device can send the first feedback information to the target information output device to instruct the target information output device to output the first feedback information to the target user. Through this method, the display device can send the feedback information corresponding to the target voice command to the target information output device associated with the target user corresponding to the target voice command, so that the target information output device can output the feedback information in place of the display device. This eliminates the need for the display device to display feedback information, completely solving the problem of the display device displaying feedback information interfering with other users.
[0081] The above provides an embodiment of a voice interaction method applied to a display device. The following provides an embodiment of a voice interaction method applied to a target information output device.
[0082] Please see Figure 4 , Figure 4This is a flowchart illustrating the implementation of a voice interaction method according to another embodiment of this application. The voice interaction method provided in this embodiment can be applied to an information output device. The voice interaction method may include steps S401 to S402, which are detailed below: In S401, the first feedback information sent by the display device is received.
[0083] In this embodiment, the first feedback information is generated by the display device based on the semantic information of the target voice command, the current display content of the display device, and the device information of the information output device, and is sent to the information output device when the information output device is in an online state; the target voice command is output by the target user based on the current display content of the display device.
[0084] For details on how to generate the first feedback information, please refer to [the relevant documentation / reference]. Figures 1 to 3 The corresponding implementation examples will not be described in detail here.
[0085] In S402, the first feedback information is output to the target user associated with the information output device.
[0086] In this embodiment, if the information output device is a first type of information output device, the information output device can output first feedback information to the target user through a voice output module; if the information output device is a second type of information output device, the information output device can output first feedback information to the target user through an image output module; if the information output device is a third type of information output device, the information output device can output first feedback information to the target user simultaneously through both a voice output module and an image output module.
[0087] Optionally, the image display module may include AR glasses, and the voice output module may include a bone conduction speaker. Based on this, the target information output device can output image feedback information to the target user through the AR glasses and voice feedback information to the target user through the bone conduction speaker.
[0088] Optionally, the image display module may include a mobile phone, and the voice output module may include headphones. Based on this, the target information output device can output image feedback information to the target user through the mobile phone and output voice feedback information to the target user through the headphones.
[0089] In one possible implementation, the target information output device can also perform... Figure 5 The corresponding steps. Please refer to [link / reference]. Figure 5 , Figure 5 This is a flowchart illustrating the implementation of a method for determining a target voice command according to an embodiment of this application. The method for determining a target voice command provided in this embodiment can be applied to a target information output device. The method for determining a target voice command may include step S501, which is detailed below: In S501, after obtaining arbitrary voice information output by the target user associated with the information output device, the arbitrary voice information is sent to the display device to instruct the display device to determine whether to recognize the arbitrary voice information as the target voice command.
[0090] In this embodiment of the application, if the target information output device has a voice information acquisition function, the target information output device can acquire any voice information output by the target user and send the arbitrary voice information output by the target user to the display device to instruct the display device to determine whether to identify the arbitrary voice information as a target voice command. Specifically, the method for the display device to determine whether to identify the arbitrary voice information as a target voice command can be: if the display device determines that the type of the arbitrary voice information is a question type or a command type, and determines that the correlation between the arbitrary voice information and the current display content is greater than a preset correlation threshold, then the display device can identify the arbitrary language information as a target language command.
[0091] It is understandable that if the target information output device does not have the function of acquiring voice information, then the target information output device may not execute the steps of S501.
[0092] The following are respectively for Figures 1 to 5 Two different embodiments are described below.
[0093] In the first embodiment, the target user directly outputs voice information to the display device (i.e., there is no need to use an information output device to send the user's voice information to the display device). The display device determines the target voice command based on the voice information, then obtains the target user's identity information based on the voice characteristics of the target voice command, and identifies the target information output device associated with the target user based on the identity information. Then, based on the semantic information of the target voice command, the current display content, and the device information of the target information output device, a first feedback message corresponding to the target voice command is generated. Finally, if the target information output device is online, the first feedback message is sent to the target information output device to instruct the target information output device to output the first feedback message to the target user. If the target information output device is offline, a second feedback message including both voice information and image information is generated based on the semantic information and the current display content, and the second feedback message is output to the target user through the display device.
[0094] In the second embodiment, the target user outputs arbitrary voice information to the corresponding target information output device. The target information output device sends the arbitrary voice information to the display device. The display device determines the target voice command based on the arbitrary voice information sent by the target information output device. Then, based on the voice characteristics of the target voice command, the target user's identity information is obtained, and based on the identity information, the target information output device associated with the target user is identified. Then, based on the semantic information of the target voice command, the current display content, and the device information of the target information output device, a first feedback message corresponding to the target voice command is generated. Finally, if the target information output device is online, the first feedback message is sent to the target information output device to instruct the target information output device to output the first feedback message to the target user. If the target information output device is offline, a second feedback message including both voice and image information is generated based on the semantic information and the current display content, and the second feedback message is output to the target user through the display device.
[0095] It is understood that the first embodiment can be applied to scenarios where the target information output device does not have the function of acquiring voice information, while the second embodiment can be applied to scenarios where the target information output device has the function of acquiring voice information.
[0096] The following provides another voice interaction system; please refer to [link / reference]. Figure 6 , Figure 6 This is a schematic diagram of the structure of a voice interaction system provided in an embodiment of this application.
[0097] like Figure 6 As shown, the voice interaction system may include a display device and an information output device, wherein the display device and the information output device are communicatively connected, and the information output device may include an image display module and / or a voice output module.
[0098] Figure 6 The working principle of the voice interaction system shown can be found by referring to... Figures 1 to 3 The corresponding implementation examples can be found in the following documents. Figure 4 and Figure 5 The corresponding implementation examples will not be described in detail here.
[0099] Through the above voice interaction system, the display device can send the feedback information corresponding to the target voice command to the target information output device associated with the target user corresponding to the target voice command. This allows the target user to obtain the corresponding feedback information from the associated target information output device. Therefore, the display device does not need to display any feedback information, which completely solves the problem that displaying feedback information will interfere with other users.
[0100] Based on the voice interaction method provided in the above embodiments, this application further provides a display device and an information output device for implementing the above method embodiments.
[0101] Please see Figure 7 , Figure 7 This is a schematic diagram of the structure of a display device provided in an embodiment of this application. Figure 7 As shown, the display device 70 may include: a voice command acquisition unit 71, a first information determination unit 72, a second information determination unit 73, and a first information output unit 74. Wherein: The voice command acquisition unit 71 is used to acquire the target voice command output by the target user who is using the display device; the target voice command is output by the target user based on the current display content of the display device.
[0102] The first information determination unit 72 is used to obtain the identity information of the target user based on the voice characteristics of the target voice command, and to identify the target information output device associated with the target user based on the identity information.
[0103] The second information determination unit 73 is used to generate first feedback information corresponding to the target voice command based on the semantic information of the target voice command, the current display content, and the device information of the target information output device.
[0104] The first information output unit 74 is used to send first feedback information to the target information output device if the target information output device is online, so as to instruct the target information output device to output the first feedback information to the target user.
[0105] Optionally, the voice command acquisition unit 71 is specifically used for: Obtain user voice information output by any user currently using the display device; For each segment of user voice information, if the type of the user voice information is a question or an instruction, and the relevance between the user voice information and the currently displayed content is greater than a preset relevance threshold, then the user voice information is identified as the target voice instruction output by the target user.
[0106] Optionally, the voice command acquisition unit 71 is specifically used for: For any voice information acquired by the display device, the voice features of the arbitrary voice information are matched with pre-stored voice features. If the match is successful, the arbitrary voice information is identified as user voice information. The pre-stored voice features include the voice features corresponding to each user associated with the display device.
[0107] Optionally, the voice command acquisition unit 71 is further used for: Receive any language information sent by any information output device associated with the display device. If the type of the arbitrary voice information is a question or an instruction, and the correlation between the arbitrary voice information and the currently displayed content is greater than a preset correlation threshold, then the arbitrary language information is identified as a target language instruction output by the target user.
[0108] Optionally, the first information determining unit 72 is specifically used for: Obtain the mapping relationship between the voiceprint information of each user and the identity information of each user; Based on the speech characteristics of the target speech command, extract the voiceprint information of the target speech command; Based on voiceprint information and mapping relationships, the identity information of the target user corresponding to the target voice command is matched from each candidate identity information.
[0109] Optionally, the second information determining unit 73 is specifically used for: Based on the device information of the target information output device, identify the device type of the target information output device; the device type is at least divided into three types: Type 1, Type 2, and Type 3. Type 1 target information output devices only include voice output modules, Type 2 target information output devices only include image output modules, and Type 3 target information output devices include both image output modules and voice output modules. If the device type is Type 1, then based on the semantic information and the currently displayed content, generate the first feedback information that includes only voice information; If the device type is type 2, then based on the semantic information and the currently displayed content, generate first feedback information that includes only image information; If the device type is type 3, then based on the semantic information and the currently displayed content, a first feedback message that includes both voice and image information is generated.
[0110] Optionally, the display device 70 may further include a second information output unit, wherein: The second information output unit is specifically used for: If the target information output device is offline, a second feedback message that includes both voice and image information is generated based on the semantic information and the current display content, and the second feedback message is output to the target user through the display device.
[0111] Please see Figure 8 , Figure 8 This is a schematic diagram of the structure of an information output device provided in an embodiment of this application. Figure 8 As shown, the information output device 80 may include: an information receiving unit 81 and a third information output unit 82. Wherein: The information receiving unit 81 is used to receive the first feedback information sent by the display device.
[0112] The third information output unit 82 is used to output first feedback information to the target user associated with the information output device; The first feedback information is generated by the display device based on the semantic information of the target voice command, the current display content of the display device, and the device information of the information output device, and is sent to the information output device when the information output device is in an online state; the target voice command is output by the target user based on the current display content of the display device.
[0113] Optionally, the information output device 80 may further include a voice information transmission unit, wherein: The voice information sending unit is used to send arbitrary voice information to the display device after acquiring arbitrary voice information output by the target user associated with the information output device, so as to instruct the display device to determine whether to recognize the arbitrary voice information as the target voice command.
[0114] It should be noted that the information interaction and execution process between the above-mentioned units are based on the same concept as the method embodiments of this application. Their specific functions and technical effects can be referred to the method embodiments section, and will not be repeated here.
[0115] Please see Figure 9 , Figure 9 This is a schematic diagram of the structure of a display device provided for another embodiment of this application. For example... Figure 9 As shown, the display device 9 provided in this embodiment may include: a processor 90, a memory 91, and a computer program 92 stored in the memory 91 and executable on the processor 90, such as a program corresponding to a voice interaction method. When the processor 90 executes the computer program 92, it implements the steps described above applied in the voice interaction method embodiment, for example... Figure 1 S101~S104 shown Figure 2 S201~S204 shown Figure 3 S301 is shown. Alternatively, when the processor 90 executes the computer program 92, it implements the functions of each module / unit in the above-described display device embodiment, for example... Figure 7 The functions of units 71-74 shown.
[0116] For example, computer program 92 can be divided into one or more modules / units, one or more of which are stored in memory 91 and executed by processor 90 to complete this application. One or more modules / units can be a series of computer program instruction segments capable of performing specific functions, which describe the execution process of computer program 92 in display device 9. For example, computer program 92 can be divided into a voice command acquisition unit 71, a first information determination unit 72, a second information determination unit 73, and a first information output unit 74. For the specific functions of each unit, please refer to [link to relevant documentation]. Figure 7 The relevant descriptions in the corresponding embodiments are not repeated here.
[0117] Those skilled in the art will understand that Figure 9 This is merely an example of display device 9 and does not constitute a limitation on display device 9. It may include more or fewer components than shown, or combine certain components, or use different components.
[0118] The processor 90 can be a central processing unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor can be a microprocessor or any conventional processor.
[0119] The memory 91 can be an internal storage unit of the display device 9, such as a hard disk or RAM of the display device 9. The memory 91 can also be an external storage device of the display device 9, such as a plug-in hard disk, smart media card (SMC), secure digital (SD) card, or flash card equipped on the display device 9. Furthermore, the memory 91 can include both internal and external storage units of the display device 9. The memory 91 is used to store computer programs and other programs and data required by the display device. The memory 91 can also be used to temporarily store data that has been output or will be output.
[0120] Please see Figure 10 , Figure 10 This is a schematic diagram of the structure of an information output device provided in another embodiment of this application. For example... Figure 10As shown, the information output device 10 provided in this embodiment may include: a processor 100, a memory 101, and a computer program 102 stored in the memory 101 and executable on the processor 100, such as a program corresponding to a voice interaction method. When the processor 100 executes the computer program 102, it implements the steps described above in the management method embodiment applied to the display device, for example... Figure 4 S401~S402 and shown Figure 5 S501 is shown. Alternatively, when the processor 100 executes the computer program 102, it implements the functions of each module / unit in the above-described information output device embodiment, for example... Figure 8 The function of unit 81 shown.
[0121] For example, computer program 102 can be divided into one or more modules / units, one or more of which are stored in memory 101 and executed by processor 100 to complete this application. One or more modules / units can be a series of computer program instruction segments capable of performing specific functions, which describe the execution process of computer program 102 in information output device 10. For example, computer program 102 can be divided into information receiving unit 81 and a third information output unit 82; please refer to the specific functions of each unit. Figure 8 The relevant descriptions in the corresponding embodiments are not repeated here.
[0122] Those skilled in the art will understand that Figure 10 This is merely an example of the information output device 10 and does not constitute a limitation on the information output device 10. It may include more or fewer components than shown, or combine certain components, or use different components.
[0123] The processor 100 can be a central processing unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor can be a microprocessor or any conventional processor.
[0124] The memory 101 can be an internal storage unit of the information output device 10, such as a hard disk or RAM of the information output device 10. The memory 101 can also be an external storage device of the information output device 10, such as a plug-in hard disk, smart media card (SMC), secure digital (SD) card, or flash card equipped on the information output device 10. Furthermore, the memory 101 can include both internal and external storage units of the information output device 10. The memory 101 is used to store computer programs and other programs and data required by the information output device. The memory 101 can also be used to temporarily store data that has been output or will be output.
[0125] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the above-described division of functional units is merely an example. In practical applications, the above functions can be assigned to different functional units as needed, that is, the internal structure of the display device and information output device can be divided into different functional units to complete all or part of the functions described above. The functional units in the embodiments can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit. Furthermore, the specific names of the functional units are only for easy differentiation and are not intended to limit the scope of protection of this application. The specific working process of the units in the above system can be referred to the corresponding process in the foregoing method embodiments, and will not be repeated here.
[0126] This application also provides a computer-readable storage medium storing a computer program that, when executed by a processor, can implement the steps in the various method embodiments described above.
[0127] This application provides a computer program product that, when run on a terminal device, enables the terminal device to implement the steps described in the various method embodiments above.
[0128] In the above embodiments, the descriptions of each embodiment have different focuses. For parts that are not described in detail or recorded in a certain embodiment, refer to the relevant descriptions of other embodiments.
[0129] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.
[0130] The above-described embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application, and should all be included within the protection scope of this application.
Claims
1. A voice interaction method, characterized in that, Applied to a display device, the method includes: Obtain a target voice command output by a target user who is currently using the display device; the target voice command is output by the target user based on the current display content of the display device. Based on the voice characteristics of the target voice command, obtain the identity information of the target user, and based on the identity information, identify the target information output device associated with the target user; Based on the semantic information of the target voice command, the current display content, and the device information of the target information output device, generate first feedback information corresponding to the target voice command; If the target information output device is online, the first feedback information is sent to the target information output device to instruct the target information output device to output the first feedback information to the target user.
2. The method according to claim 1, characterized in that, The step of obtaining the target voice command output by the target user who is using the display device includes: Acquire user voice information output by any user currently using the display device; For each segment of user voice information, if the type of the user voice information segment is a question type or an instruction type, and the relevance between the user voice information segment and the currently displayed content is greater than a preset relevance threshold, then the user voice information segment is identified as the target voice instruction output by the target user.
3. The method according to claim 2, characterized in that, The step of obtaining user voice information output by any user currently using the display device includes: For any voice information acquired by the display device, the voice features of the arbitrary voice information are matched with pre-stored voice features. If the match is successful, the arbitrary voice information is identified as the user's voice information. The pre-stored voice features include the voice features corresponding to each user associated with the display device.
4. The method according to claim 1, characterized in that, The step of obtaining the target voice command output by the target user who is using the display device further includes: The system receives any language information sent by any information output device associated with the display device. If the type of the arbitrary voice information is a question or an instruction, and the correlation between the arbitrary voice information and the currently displayed content is greater than a preset correlation threshold, then the arbitrary language information is identified as a target language instruction output by the target user.
5. The method according to any one of claims 1 to 4, characterized in that, The step of obtaining the target user's identity information based on the voice features of the target voice command includes: Obtain the mapping relationship between the voiceprint information of each user and the identity information of each user; Based on the speech features of the target speech command, extract the voiceprint information of the target speech command; Based on the voiceprint information and the mapping relationship, the identity information of the target user corresponding to the target voice command is matched from each candidate identity information.
6. The method according to any one of claims 1 to 4, characterized in that, The step of generating first feedback information corresponding to the target voice command based on the semantic information of the target voice command, the currently displayed content, and the device information of the target information output device includes: Based on the device information of the target information output device, the device type of the target information output device is identified; the device type is at least divided into: a first type, a second type and a third type, wherein the first type of target information output device only includes a voice output module, the second type of target information output device only includes an image output module, and the third type of target information output device includes both an image output module and a voice output module; If the device type is the first type, then based on the semantic information and the currently displayed content, the first feedback information, which only includes voice information, is generated; If the device type is the second type, then based on the semantic information and the currently displayed content, the first feedback information, which only includes image information, is generated; If the device type is the third type, then the first feedback information, which includes both voice information and image information, is generated based on the semantic information and the currently displayed content.
7. The method according to any one of claims 1 to 4, characterized in that, After generating first feedback information corresponding to the target voice command based on the semantic information of the target voice command, the currently displayed content, and the device information of the target information output device, the method further includes: If the target information output device is offline, a second feedback information including both voice and image information is generated based on the semantic information and the current display content, and the second feedback information is output to the target user through the display device.
8. A voice interaction method, characterized in that, Applied to an information output device, the method includes: Receive the first feedback information sent by the display device; The first feedback information is output to the target user associated with the information output device; The first feedback information is generated by the display device based on the semantic information of the target voice command, the current display content of the display device, and the device information of the information output device, and is sent to the information output device when the information output device is in an online state; the target voice command is output by the target user based on the current display content of the display device.
9. The method according to claim 8, characterized in that, The method further includes: After acquiring any voice information output by the target user associated with the information output device, the arbitrary voice information is sent to the display device to instruct the display device to determine whether to recognize the arbitrary voice information as the target voice command.
10. A voice interaction system, characterized in that, The method includes a display device and an information output device, wherein the display device is used to perform the steps of the voice interaction method as described in any one of claims 1 to 7, and the information output device is used to perform the steps of the voice interaction method as described in any one of claims 8 to 9.