Implementation of on-device voice assistant
By introducing a streamlined library and a cloud-connected voice assistant system on the device side, the problem of consistent experience across devices is solved, enabling devices to update and integrate voice assistant functions without additional work.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- GOOGLE LLC
- Filing Date
- 2017-05-10
- Publication Date
- 2026-06-19
AI Technical Summary
Existing voice assistant systems struggle to provide a consistent user experience across multiple devices, and device manufacturers need to do additional work to receive and integrate voice assistant functionality.
It employs a streamlined, low-resource-consumption device-side library, including local processing of audio data and wake word detection, supports connectivity with the cloud brain, and provides a scalable voice action control system to enable cross-device voice assistant functionality.
It achieves a consistent user experience across multiple devices, and device manufacturers can obtain updates and integrations of the voice assistant functionality without additional work.
Smart Images

Figure CN122240055A_ABST
Abstract
Description
[0001] Case Analysis
[0002] This application is a divisional application of Chinese Invention Patent Application No. 201780009238.2, filed on May 10, 2017. Technical Field
[0003] This application generally relates to computer technology, including but not limited to voice assistants and related libraries for devices. Background Technology
[0004] Voice-based assistants, which interact with users through audio / voice input and output, are becoming increasingly popular with the development of the internet and cloud computing. These assistants can provide interfaces for digital media consumption and offer various types of information, including news, sports scores, weather, and stocks, among others.
[0005] Users can have multiple devices that offer voice-based assistant functionality. The expectation is for a voice-based assistant that can be implemented and used across various devices, provides a consistent experience across devices, and supports device-specific features. Summary of the Invention
[0006] The embodiments described in this specification are intended to embed or include voice assistants in embedded systems and / or devices in a manner that enables control of local devices for various operating system platforms.
[0007] According to some implementations, the streamlined, low-resource-consumption device-side library features include local processing of audio data, listening for wakewords or hotwords, and sending user requests. Other functionalities include connectivity with a cloud-based AI, a scalable voice-motion control system, a portability layer that allows integration into various operating environments, and the ability to asynchronously update with the rest of the client software.
[0008] The described implementation has the advantage of providing a similar user experience for interacting with a voice assistant on many different devices.
[0009] The described implementation has another advantage: it enables decoupled innovation in voice assistant capabilities from innovations enabled by the device itself. For example, if an improved recognition pipeline is created, it can be pushed to the device, and the device manufacturer does not need to do anything to receive it and can still benefit from previous voice commands.
[0010] According to some embodiments, a method at an electronic device having an audio input system, one or more processors, and a memory storing one or more programs executed by the one or more processors includes: receiving verbal input at the device; processing the verbal input; transmitting a request to a remote system, the request including information determined based on the verbal input; receiving a response to the request, wherein the response is generated by the remote system based on the information based on the verbal input; and performing an operation based on the response, wherein one or more of the receiving, processing, transmitting, receiving, and performing are performed by one or more voice processing modules of a voice assistant library executed on the electronic device, the voice processing modules providing multiple voice processing operations accessible to one or more applications and / or operating software executed or executable on the electronic device.
[0011] In some implementations, a device-agnostic voice assistant library for an electronic device including an audio input system includes: one or more voice processing modules configured to execute on a common operating system implemented on multiple different types of electronic devices, the voice processing modules providing multiple voice processing operations accessible to applications and operating software executed on the electronic device, thereby enabling the portability of voice-enabled applications configured to interact with one or more voice processing operations.
[0012] In some embodiments, the electronic device includes an audio input system, one or more processors, and memory storing one or more programs to be executed by the one or more processors. The one or more programs include instructions for performing the following steps: receiving verbal input at a device; processing the verbal input; transmitting a request to a remote system, the request including information determined based on the verbal input; receiving a response to the request, wherein the response is generated by the remote system based on the information based on the verbal input; and performing an operation based on the response, wherein one or more of the receiving, processing, transmitting, receiving, and performing operations are performed by one or more voice processing modules of a voice assistant library running on the electronic device, the voice processing modules providing access to multiple voice processing operations accessible to one or more applications and / or operating software running or executable on the electronic device.
[0013] In some embodiments, a non-transitory computer-readable storage medium stores one or more programs. The one or more programs include instructions, when executed by an electronic device having an audio input system and one or more processors, to cause the electronic device to perform the following operations: receiving verbal input at a device; processing the verbal input; transmitting a request to a remote system, the request including information determined based on the verbal input; receiving a response to the request, wherein the response is generated by the remote system based on the information based on the verbal input; and performing the operation according to the response, wherein one or more of the receiving, processing, transmitting, receiving, and performing operations are performed by one or more voice processing modules of a voice assistant library executing on the electronic device, the voice processing modules providing multiple voice processing operations accessible to one or more applications and / or operating software executing or executable on the electronic device. Attached Figure Description
[0014] Figure 1 This is a block diagram illustrating an example network environment according to some implementation methods.
[0015] Figure 2 This is a diagram illustrating an example voice assistant client device according to some implementations.
[0016] Figure 3 This is a diagram illustrating an example server system according to some implementation methods.
[0017] Figure 4 This is a block diagram showing a functional view of a voice assistant library according to some implementations.
[0018] Figure 5 This is a flowchart of a method for processing verbal input on a device, according to some embodiments.
[0019] Throughout the accompanying drawings, the same reference numerals refer to the corresponding parts. Detailed Implementation
[0020] Various embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. Numerous specific details are set forth in the following detailed description to provide a thorough understanding of the invention and the described embodiments. However, the invention may be practiced without these specific details. In other instances, well-known methods, processes, components, and circuits have not been described in detail so as not to unnecessarily obscure aspects of the embodiments.
[0021] In some implementations, the purpose of a voice assistant is to provide users with a personalized voice interface available across various devices and to enable a wide range of use cases, delivering a consistent experience throughout the user day. Voice assistants and / or related functionalities can be integrated into first-party and third-party products and devices.
[0022] Example use cases involve media. Voice commands can be used to initiate and control the playback of music, radio, podcasts, news, and other audio media via voice. For example, a user can issue voice commands such as "play jazzmusic," "play 107.5 FM," "skip to next song," or "play 'Serial'" to play or control various types of audio media. Furthermore, such commands can be used to play audio media from a variety of sources, such as online streaming from terrestrial radio stations, music subscription services, local storage, remote storage, etc. Additionally, voice assistants can leverage the integrations available on casting devices to support additional content.
[0023] Another example use case involves remote playback. A user can issue voice commands to a projection device that includes voice assistant functionality, and based on the voice command, media is played back (e.g., projected) on a device specified in the command, on a device in a specified group of one or more devices, or on one or more devices in a region specified in the command. The user can also specify a general category or specific content in the command, and play the appropriate media based on the category or content specified in the command.
[0024] Another example use case is non-media, such as productivity features (e.g., timers, alarm clocks, calendars), home automation, search engine-powered questions and answers (e.g., search queries), fun (e.g., assistant characters, jokes, games, Easter eggs), and daily tasks (e.g., transportation, navigation, food, finance, gifts, etc.).
[0025] In some implementations, a voice assistant is provided as an optional feature of the projection device, and the voice assistant functionality can be updated to be part of the projection device.
[0026] In some implementations, an application processor (e.g., executing at a client device or projection device where the user speaks a voice command or utters verbal input) performs the detection of hot words or keywords from the user's voice commands and verbal input. In other implementations, hot word detection is performed by an external digital signal processor (e.g., by a server system that processes the voice commands, in contrast to a client or projection device where the user speaks a voice command or utters verbal input).
[0027] In some implementations, a device with voice assistant features includes one or more of the following: far-field support, "push to assist" or "push to talk" (e.g., a button to initiate voice assistant functionality), and AC power.
[0028] In some implementations, the voice assistant includes an application programming interface (API) for one or more of the following: audio input (e.g., a microphone for playback of media), microphone status (e.g., on / off), avoidance (e.g., reducing the volume of all outputs when the assistant is triggered by a hot word or press-to-talk), and new assistant events and status messages (e.g., the assistant is triggered (e.g., hearing a hot word, pressing an assistant button), listening to a voice, waiting on a server, responding, response complete, an alarm / timer is playing).
[0029] In some implementations, a device with voice assistant functionality may communicate with another device for configuration purposes (e.g., using a configuration app on a smartphone) to enable or facilitate the functionality of the voice assistant on the device (e.g., setting up the voice assistant functionality on the device, providing tutorials to the user). Configuration or settings may include specifying the device location, associating with a user account, allowing the user to choose to enter voice control, linking to and prioritizing media services (e.g., video streaming services, music streaming services), and configuring home automation, etc.
[0030] In some implementations, a device with a voice assistant may include one or more user interface elements or indications to the user. The one or more user interface elements are physical (e.g., as light patterns displayed using one or more LEDs, as sound patterns output by a speaker) and may include one or more of the following: a "tap to help" or "tap to help speak" trigger independent of hotwords; a "mute microphone" trigger and visual status indication; a "awaiting hotword status" visual indication; a "hotword detected" visual indication; a "assistant is actively listening" visual indication visible at a distance (e.g., 15 feet); a "assistant is working / thinking" visual indication; a "voice message / notification is available" visual indication; a "volume level" control method and status indicator; and a "pause / resume" control method. In some implementations, these physical user interface elements are provided by a client device or a projection device. In some implementations, the voice assistant supports a set of user interface elements or instructions that are common across different devices to ensure a consistent experience across different devices.
[0031] In some implementations, the voice assistant supports device-specific commands and / or hot words as well as a standardized, predefined set of commands and / or hot words.
[0032] Figure 1 A network environment 100 according to some embodiments is illustrated. The network environment 100 includes a projection device 106 and / or a voice assistant client device 104. The projection device 106 (e.g., CHROMECAST from Google INC.) is directly or otherwise communicatively coupled to an audio input device 108 (e.g., a microphone) and an audio output device 110 (e.g., one or more speakers). In some embodiments, both the audio input device 108 and the audio output device 110 are components of devices (e.g., speaker systems, televisions, sound bars) communicatively coupled to the projection device 106. In some embodiments, the audio input device 108 is a component of the projection device 106, and the audio output device 110 is a component of a device communicatively coupled to the projection device 106, and vice versa. In some embodiments, both the audio input device 108 and the audio output device 110 are components of the projection device 106.
[0033] In some implementations, the projection device 106 is communicatively coupled to the client 102. The client 102 may include an application or module that facilitates the configuration of the projection device 106 (e.g., a projection device settings application), including voice assistant features.
[0034] In some implementations, the projection device 106 is coupled to the display 144.
[0035] In some implementations, the projection device 106 includes one or more visual indicators 142 (e.g., LED lights).
[0036] In some embodiments, the projection device 106 includes a receiver module 146. In some embodiments, for example, the receiver module 146 operates including hardware functions and communication with the projection device 106 for content sources. In some embodiments, different receiver modules 146 are present at the projection device 106 for different content sources. In some embodiments, the receiver module 146 includes corresponding submodules for different content sources.
[0037] Voice assistant client device 104 (e.g., a smartphone, laptop, desktop computer, tablet, voice command device, mobile device, or in-vehicle system of Google Inc.'s Google Assistant, Google Inc.'s Google Home) includes an audio input device 132 (e.g., a microphone) and an audio output device 134 (e.g., one or more speakers, headphones). In some embodiments, voice assistant client device 104 (e.g., a voice command device, mobile device, or in-vehicle system of Google Inc.'s Google Assistant, Google Inc.'s Google Home) is communicatively coupled to client 140 (e.g., a smartphone, tablet device). Client 140 may include applications or modules that facilitate the configuration of voice assistant client device 104 (e.g., a voice command device settings application), including voice assistant features.
[0038] In some implementations, the voice assistant client device 104 includes one or more visual indicators 152 (e.g., LED lights). An example of a voice assistant client device with visual indicators (e.g., LED lights) is found in U.S. Provisional Application No. 62 / 336,566, filed May 13, 2016, entitled “LED Design Language for Visual Affordance of Voice User Interfaces.” Figure 4 As shown in A, the application is incorporated herein by reference in its entirety.
[0039] Projection device 106 and voice assistant client device 104 include corresponding instances of voice assistant module or library 136. Voice assistant module / library 136 is a module / library that implements voice assistant functionality across various devices (e.g., projection device 106, voice assistant client device 104). Voice assistant functionality remains consistent across devices while still implementing device-specific features (e.g., supporting control of device-specific features via voice assistant). In some implementations, voice assistant module or library 136 is the same or similar across devices; instances of the same library may be included in various devices.
[0040] In some implementations, depending on the type of device, the voice assistant module / library 136 is included in an application installed in the device, in the device's operating system, or embedded in the device (e.g., embedded in firmware).
[0041] In some implementations, the voice assistant module / library 136-1 at the projection device 106 communicates with the receiver module 146 to perform voice assistant operations.
[0042] In some implementations, the voice assistant module / library 136-1 at projection device 106 can control or otherwise influence the visual indicator 142.
[0043] In some implementations, the voice assistant module / library 136-2 at the voice assistant client device 104 may control or otherwise influence the visual indicator 152.
[0044] Projection device 106 and voice assistant client device 104 are communicatively coupled to server system 114 via one or more communication networks 112 (e.g., LAN, WAN, Internet). Voice assistant module / library 136 detects (e.g., receives) spoken input picked up (e.g., captures) by audio input device 108 / 132, processes the spoken input (e.g., detects hot words), and transmits the processed spoken input or its encoded form to server 114. Server 114 receives the processed spoken input or its encoding and processes the received spoken input to determine an appropriate response. An appropriate response may be content, information, instructions, commands, or metadata that performs a function or operation on projection device 106 or voice assistant client device 104. Server 114 sends this response to projection device 106 or voice assistant client device 104, where content or information is output (e.g., via audio output device 110 / 134) and / or a function is performed. As part of the process, server 114 may communicate with one or more content or information sources 138 to obtain content or information for a response, or a reference thereto. In some embodiments, content or information sources 138 may include, for example, search engines, databases, information associated with a user's account (e.g., calendar, task list, email), websites, and media streaming services. In some embodiments, voice assistant client device 104 and projection device 106 may communicate with or interact with each other. Examples of such communication or interaction, as well as example operations of a voice assistant client device 104 (e.g., Google Home of Google Inc.), are described in U.S. Provisional Application No. 62 / 336,566, filed May 13, 2016, entitled “LED Design Language for Visual Affordance of Voice User Interfaces”, U.S. Provisional Application No. 62 / 336,569, filed May 13, 2016, entitled “Voice-Controlled ClosedCaption Display”, and U.S. Provisional Application No. 62 / 336,565, filed May 13, 2016, entitled “Media Transfer among Media Output Devices”, all of which are incorporated herein by reference in their entirety.
[0045] In some implementations, the voice assistant module / library 136 receives spoken input captured by the audio input device 108 / 132 and transmits the spoken input (with little or no processing) or its encoding to the server 114. The server 114 processes the spoken input to detect hot words, determine an appropriate response, and sends the response to the projection device 106 or the voice assistant client device 104.
[0046] If server 114 determines that the verbal input includes a command for projection device 106 or voice assistant client device 104 to perform a function, server 114 transmits instructions or metadata in a response that instruct projection device 106 or voice assistant client device 104 to perform that function. This function may be device-specific, and the ability to support these functions in the voice assistant may be included in projection device 106 or client 104 as a custom module or function added to or linked to voice assistant module / library 136.
[0047] In some implementations, server 114 includes or is coupled to a speech processing backend 148, which performs spoken input processing operations and determines a response to spoken input.
[0048] In some implementations, server 114 includes a downloadable voice assistant library 150. The downloadable voice assistant library 150 (e.g., the same as or an update of voice assistant library 136) may include new features and functionalities or updates, and may be downloaded to add the voice assistant library to a device or update voice assistant library 136.
[0049] Figure 2This is a block diagram illustrating an example voice assistant client device 104 or projection device 106 of a network environment 100 according to some embodiments. Examples of voice assistant client devices 104 include, but are not limited to, mobile phones, tablet computers, laptop computers, desktop computers, wireless speakers (e.g., Google Home from Google Inc.), voice command devices (e.g., Google Home from Google Inc.), televisions, soundbars, projection devices (e.g., CHROMECAST from Google Inc.), media streaming devices, home appliances, consumer electronics, in-vehicle systems, and wearable personal devices. Voice assistant client device 104 (e.g., Google Home from Google Inc., a mobile device with Google Assistant capability) or projection device 106 (e.g., CHROMECAST from Google Inc.) typically includes one or more processing units (CPUs) 202, a network interface 204, a memory 206, and one or more communication buses 208 (sometimes referred to as chipsets) for interconnecting these components. The voice assistant client device 104 or projection device 106 includes one or more input devices 210 to facilitate user input, including audio input devices 108 or 132 (e.g., voice command input unit or microphone) and optional other input devices such as keyboard, mouse, touchscreen display, touch-sensitive input pad, gesture capture camera, or other input buttons or controls. In some embodiments, the voice assistant client device 102 uses a microphone and voice recognition or a camera and gesture recognition to supplement or replace the keyboard. The voice assistant client device 104 or projection device 106 also includes one or more output devices 212, including audio output devices 110 or 134 (e.g., one or more speakers, headphones, etc.), and optional one or more visual displays (e.g., display 144) and / or one or more visual indicators 142 or 152 (e.g., LEDs) capable of presenting a user interface and displaying content and information. Optionally, the voice assistant client device 104 or projection device 106 includes a location detection unit 214, such as GPS (Global Positioning Satellite) or other geolocation receiver, for determining the location of the voice assistant client device 104 or projection device 106. The voice assistant client device 104 or projection device 106 may optionally include a proximity detection device 215, such as an IR sensor, for determining the voice assistant client device 104 or projection device 106 in relation to other objects (e.g., a user, or, in the case of a wearable personal device, the wearer). Optionally, the voice assistant client device 104 or projection device 106 includes a sensor 213 (e.g., an accelerometer, a gyroscope, etc.).
[0050] Memory 206 includes high-speed random access memory, such as DRAM, SRAM, DDR RAM, or other random access solid-state memory devices; and optionally includes non-volatile memory, such as one or more disk storage devices, one or more optical disk storage devices, one or more flash memory devices, or one or more other non-volatile solid-state memory devices. Memory 206 optionally includes one or more storage devices located remotely from one or more processing units 202. The non-volatile memory within memory 206 or alternatively memory 206 includes a non-transitory computer-readable storage medium. In some embodiments, memory 206 or the non-transitory computer-readable storage medium of memory 206 stores programs, modules, and data structures, or subsets or supersets thereof: • Operating system 216, which includes procedures for handling various basic system services and for performing hardware-related tasks; • Network communication module 218 is used to connect the voice assistant client device 104 or projection device 106 to other devices (e.g., server system 114, clients 102, 140, other voice assistant client devices 104 or projection devices 106) via one or more network interfaces 204 (wired or wireless) and one or more networks 112 (e.g., Internet, other wide area networks, local area networks, metropolitan area networks, etc.). • User interface module 220, which enables information to be presented on voice assistant client device 104 or projection device 106 via one or more output devices 212 (e.g., display, speaker, etc.); • Input processing module 222 is used to process one or more user inputs or interactions captured or received by one or more input devices 210, and to interpret the inputs or interactions; • Voice assistant module 136, used for processing verbal input, providing verbal input to server 114, receiving responses from server 114, and outputting those responses; and • Client data 226, used to store at least the data associated with the voice assistant module 136, including: o Voice assistant settings 228, used to store information associated with the settings and configurations of the voice assistant module 136 and voice assistant functions; o Content / information source 230 and category 232, used to store predefined and / or user-specified sources and categories of content or information; oUse history 234 is used to store information (e.g., logs) associated with the operation and use of the voice assistant module 136, such as received commands and requests, responses to commands and requests, operations performed in response to commands and requests, etc.; and o User accounts and authorizations 236, used to store authorization and authentication information for one or more users to access the corresponding user accounts at content / information source 230, as well as account information for these authorized accounts; and o receiver module 146 is used to operate the projection function of projection device 106, including communicating with content source to receive content for playback.
[0051] In some implementations, the voice assistant client device 104 or projection device 106 includes one or more libraries and one or more application programming interfaces (APIs) for voice assistant and related functions. These libraries may be included in or linked to the voice assistant module 136 or receiver module 146. The library includes modules associated with voice assistant functions or other functions that facilitate voice assistant functions. The API provides interfaces to hardware and other software (e.g., operating system, other applications) to facilitate voice assistant functions. For example, the voice assistant client library 240, debug library 242, platform API 244, and POSIX API 246 may be stored in memory 206. These libraries and APIs are referenced below. Figure 4 Further description.
[0052] In some embodiments, the voice assistant client device 104 or projection device 106 includes a voice application 250 that uses the modules and functions of the voice assistant client library 240, and optional debugging libraries 242, platform APIs 244, and POSIX APIs 246. In some embodiments, the voice application 250 is a voice-enabled first-party or third-party application, etc., that uses the voice assistant client library 240.
[0053] Each of the aforementioned identified elements can be stored in one or more of the aforementioned memory devices and corresponds to an instruction set for performing the aforementioned functions. The aforementioned identified modules or programs (i.e., instruction sets) do not need to be implemented as separate software programs, processes, modules, or data structures, and therefore various subsets of these modules can be combined or otherwise rearranged in various embodiments. In some embodiments, memory 206 may optionally store a subset of the aforementioned identified modules and data structures. Furthermore, memory 206 may optionally store additional modules and data structures not described above.
[0054] Figure 3This is a block diagram illustrating an example server system 114 of a network environment 100 according to some embodiments. Server 114 typically includes one or more processing units (CPUs) 302, one or more network interfaces 304, memory 306, and one or more communication buses 308 for interconnecting these components (sometimes referred to as chipsets). Server 114 optionally includes one or more input devices 310 facilitating user input, such as a keyboard, mouse, voice command input unit or microphone, touchscreen display, touch-sensitive input pad, gesture capture camera, or other input buttons or controls. Furthermore, server 114 may use a microphone and voice recognition or a camera and gesture recognition to supplement or replace a keyboard. In some embodiments, server 114 optionally includes one or more cameras, scanners, or light sensor units for capturing images, such as graphic serial codes printed on electronic devices. Server 114 optionally also includes one or more output devices 312 enabling the presentation of a user interface and display content, including one or more speakers and / or one or more visual displays.
[0055] Memory 306 includes high-speed random access memory, such as DRAM, SRAM, DDR RAM, or other random access solid-state memory devices; and optionally includes non-volatile memory, such as one or more disk storage devices, one or more optical disk storage devices, one or more flash memory devices, or one or more other non-volatile solid-state memory devices. Memory 306 optionally includes one or more storage devices located remotely from one or more processing units 302. The non-volatile memory within memory 306 or alternatively memory 306 includes a non-transitory computer-readable storage medium. In some embodiments, memory 306 or the non-transitory computer-readable storage medium of memory 306 stores programs, modules, and data structures, or subsets or supersets thereof: • Operating system 316, which includes procedures for handling various basic system services and for performing hardware-related tasks; • Network communication module 318 is used to connect server system 114 to other devices (e.g., voice assistant client device 104, projection device 106, client 102, client 140, etc.) via one or more network interfaces 304 (wired or wireless) and one or more networks 112 (e.g., Internet, other wide area networks, local area networks, metropolitan area networks, etc.). • Proximity / location determination module 320 is used to determine the proximity and / or location of the voice assistant client device 104 or the projection device 106 based on the location information of the client device 104 or the projection device 106; • Voice assistant backend 116, for processing voice assistant verbal input (e.g., verbal input received from voice assistant client device 104 and projection device 106), includes one or more of the following: o Verbal input processing module 324 processes verbal input to identify commands and requests in the verbal input; o Content / information collection module 326, collects content and information responses to commands and requests; and o Response generation module 328, in response to commands and requests, generates verbal output and fills the verbal output with response content and information; and • Server system data 330 stores at least the data related to the operation of the voice assistant platform, including: oUser data 332, used to store information associated with users of the voice assistant platform, including: • User voice assistant settings 334, used to store voice assistant setting information corresponding to voice assistant settings 228, as well as information corresponding to content / information source 230 and category 232; • User History 336, used to store the user's history using a voice assistant (e.g., logs), including the history of commands and requests and corresponding responses; and • User account and authorization 338, used to store the user's authorization and authentication information to access the corresponding user account at content / information source 230, as well as the account information of those authorized accounts corresponding to user account and authorization 236.
[0056] Each of the aforementioned identified elements can be stored in one or more of the aforementioned memory devices and corresponds to an instruction set for performing the aforementioned functions. The aforementioned identified modules or programs (i.e., instruction sets) do not need to be implemented as separate software programs, processes, modules, or data structures, and therefore various subsets of these modules can be combined or otherwise rearranged in various embodiments. In some embodiments, memory 306 may optionally store a subset of the aforementioned identified modules and data structures. Furthermore, memory 306 may optionally store additional modules and data structures not described above.
[0057] In some implementations, the voice assistant module 136 ( Figure 2 This includes one or more libraries. A library includes modules or submodules that perform corresponding functions. For example, a voice assistant client library contains modules that perform voice assistant functions. Voice assistant module 136 may also include one or more application programming interfaces (APIs) for cooperating with specific hardware (such as hardware on a client or projection device), specific operating software, or a remote system.
[0058] In some implementations, the library includes modules supporting audio signal processing operations, including, for example, bandpass, filtering, erasure, and hotword detection. In some implementations, the library includes modules for connecting to a backend (e.g., server-based) voice processing system. In some implementations, the library includes modules for debugging (e.g., debugging voice recognition, debugging hardware problems, automated testing).
[0059] Figure 4 Libraries and APIs that can be stored in the voice assistant client device 104 or projection device 106 and run by the voice assistant module 136 or another application are illustrated. The libraries and APIs may include a voice assistant client library 240, a debug library 242, a platform API 244, and a POSIX API 246. The voice assistant client device 104 or projection device 106 (e.g., the voice assistant module 136, which may wish to support other applications collaborating with the voice assistant) may include or link to the libraries and APIs and run to provide or voice-enabled assistant functionality within the application. In some implementations, the voice assistant client library 240 and the debug library 242 are separate libraries; maintaining the separation of the voice assistant client library 240 and the debug library 242 facilitates different release and update processes that interpret the different security implications of these libraries.
[0060] In some implementations, the library is flexible; it can be used on various device types and contains the same voice assistant functionality.
[0061] In some implementations, the library relies on standard shared objects (e.g., standard Linux shared objects) and is therefore compatible with different operating systems or platforms that use these standard fragmented objects (e.g., various Linux distributions and flavors of embedded Linux).
[0062] In some implementations, POSIX API 246 provides a standard API for compatibility with various operating systems. Therefore, the voice assistant client library 240 can be included in devices with different POSIX-compliant operating systems, and POSIX API 246 provides a compatibility interface between the voice assistant client library 240 and different operating systems.
[0063] In some implementations, the library includes modules that support and facilitate basic use cases available on different types of devices that implement voice assistants (e.g., timers, alarms, volume control).
[0064] In some implementations, the voice assistant client library 240 includes a controller interface 402, which includes functions or modules for launching, configuring, and interacting with the voice assistant. In some implementations, the controller interface 402 includes a "Start()" function or module 404 for launching the voice assistant at the device; a "RegisterAction()" function or module 406 for registering an action using the voice assistant (e.g., making the action actionable via the voice assistant); a "Reconfigure()" function 408 for reconfiguring the voice assistant with updated settings; and a "RegisterEventObserver()" function 410 for registering a set of functions for basic events using the assistant.
[0065] In some implementations, the voice assistant client library 240 includes multiple functions or modules associated with specific voice assistant functions. For example, a hot word detection module 412 processes voice input to detect hot words. A voice processing module 414 processes the voice input and converts the voice into text or vice versa (e.g., word and phrase recognition, voice-to-text data conversion, text-to-voice data conversion). An action processing module 416 performs actions and operations in response to verbal input. A local timer / alarm / volume control module 418 facilitates alarm clock, timer, and volume control functions at the device and controls them via voice input (e.g., maintaining timers, clocks, alarm clocks at the device). A log / metric module 420 records (e.g., logs) voice input and responses, and determines and records relevant metrics (e.g., response time, idle time, etc.). An audio input processing module 422 processes the audio of the voice input. An MP3 decoding module 424 decodes MP3-encoded audio. An audio input module 426 captures audio via an audio input device (e.g., a microphone). The audio output module 428 outputs audio through an audio output device (e.g., a speaker). The event queuing and status tracking module 430 is used to queue events associated with the voice assistant at the device and track the status of the voice assistant at the device.
[0066] In some implementations, debug library 242 provides modules and functions for debugging. For example, HTTP server module 432 facilitates debugging connectivity issues, and debug server / audio stream module 434 is used for debugging audio issues.
[0067] In some implementations, platform API 244 provides an interface between the voice assistant client library 240 and the hardware functions of the device. For example, the platform API includes a button input interface 436 for capturing button input on the device, a loopback audio interface 438 for capturing loopback audio, a logging and metrics interface 440 for recording and determining metrics, an audio input interface 442 for capturing audio input, an audio output interface 444 for outputting audio, and an authentication interface 446 for authenticating the user using other services that can interact with the voice assistant. Figure 4 The advantage of the illustrated voice assistant client library organization is that it can provide the same or similar voice processing capabilities across various voice assistant device types with a consistent API and set of voice assistant features. This consistency supports the portability of voice assistant applications and the consistency of voice assistant operation, which in turn facilitates consistent user interaction and familiarity with voice assistant applications and functions implemented on different device types. In some implementations, all or part of the voice assistant client library 240 may be provided at server 114 to support server-based voice assistant applications (e.g., server applications that operate on voice input transmitted to server 114 for processing).
[0068] Below is sample code for the classes and functions corresponding to controller 402 (“Controller”) and related classes. These classes and functions can be adopted by applications executable on various devices via a common API.
[0069] The following class, "ActionModule," enables applications to register their own modules to handle commands provided by the voice assistant server: / / Applications can register their own software modules / / To process commands provided by the voice assistant server.
[0070] class ActionModule {
[0071] public:
[0072] / / Action result description: Whether the action was successfully executed.
[0073] class Result{
[0074] public:
[0075] virtual~Result() = default;
[0076] / / Set the action result to indicate success.
[0077] virtual void SetOk() = 0;
[0078] / / Set the action result to the given response code and human-readable string.
[0079] virtual void SetError (int response_code, const std::string&str); }; / / Arguments for the action processor.
[0080] class Args {
[0081] public:
[0082] virtual~Args() = 0;
[0083] / / Get the serialized protobuf data of the action handler argument of a given type.
[0084] virtual bool GetProtobufDataFromType
[0085] (std:: string type, std:: string
[0086] data) = 0;
[0087] };
[0088] virtual~ActionModule() = 0;
[0089] / / Returns the name of this module.
[0090] virtual std:: string GetName () = 0;
[0091] / / Process the given action_name using its | args | and update the result based on the outcome of the action execution.
[0092] virtual void Handle (std:: string action_name, std::unique_ptr <args>args, Result result) = 0; / / Set the named protobuf to the given serialized data to instruct the voice assistant / / This indicates the local state of this module.
[0093] virtual bool GetModuleContext (std:: string protobuf_type, std::string protobuf_data) = 0; }; The following class, "BuildInfo," can be used to describe the application running the voice assistant client library 240 or the voice assistant client device 104 itself (e.g., using the application, platform, and / or device identifier or version number): / / Build information to describe the application / / Run the voice assistant client library. For dedicated voice assistant devices, this should describe the device.
[0094] / / This object will be returned from CreateDefaultBuildInfo, and can
[0095] / / It was modified, and then the object was set back.
[0096] class BuildInfo {
[0097] public:
[0098] virtual~BuildInfo () = default;
[0099] / / Set the application version.
[0100] virtual void SetApplicationVersion (const std:: string&
[0101] application_version) = 0;
[0102] / / Set the installation identifier. This must be a device-specific identifier.
[0103] / / It should not be the same as any other device or user identifier.
[0104] virtual void SetInstallId (const std:: string&
[0105] install_id) = 0;
[0106] / / Set the platform identifier.
[0107] virtual void SetPlatformId (const std:: string&
[0108] platform_id) = 0;
[0109] / / Set the platform version.
[0110] virtual void SetPlatformVersion (const std:: string&
[0111] platform_version) = 0;
[0112] / / Set device model. Optional.
[0113] virtual void SetDeviceModel (const std:: string&
[0114] device_model) = 0;
[0115] };
[0116] The following class, "EventDelegate," defines functions associated with basic events, such as the start of speech recognition, the start and completion of the voice assistant's output of a speech response, etc. / / Receive events from the assistant library.
[0117] class EventDelegate {
[0118] public:
[0119] class RecognizedSpeechChangedEvent {
[0120] public:
[0121] virtual~RecognizedSpeechChangedEvent () {}
[0122] / / Indicates the updated recognized text from the voice assistant. If / / OnRecognizedSpeechFinishedEvent is a part of this, it indicates the final...
[0123] / / Recognize text.
[0124] virtual std::string
[0125] GetRecognizedSpeech() = 0;
[0126] };
[0127] virtual~EventDelegate() {}
[0128] / / Indicates that the voice assistant client library is starting.
[0129] virtual void OnBootingUpEvent () = 0;
[0130] / / The instruction indicates that the words being heard are trending words.
[0131] virtual void OnHeardHotwordEvent () = 0;
[0132] / / This indicates that voice recognition has begun. Voice recognition will...
[0133] / / Continue until the OnRecognizingSpeechFinishedEvent is received.
[0134] virtual void OnRecognizingSpeechStartedEvent () = 0;
[0135] / / Indicates a change to the current assumptions about the identified speech.
[0136] / / Occurs. |event| indicates a new hypothesis.
[0137] virtual void OnRecognizedSpeechChangedEvent (
[0138] const RecognizedSpeechChangedEvent&event) = 0;
[0139] / / This indicates that final voice recognition has occurred.
[0140] / / | event | indicates the final value.
[0141] virtual void OnRecognizingSpeechFinishedEvent (
[0142] const RecognizedSpeechChangedEvent&event) = 0;
[0143] / / Instructs the voice assistant to begin responding via voice.
[0144] The voice assistant will respond until the OnRespondingFinishedEvent.
[0145] / / receive.
[0146] virtual void OnRespondingStartedEvent () = 0;
[0147] / / Indicates that the voice assistant has completed the response via voice.
[0148] virtual void OnRespondingFinishedEvent () = 0;
[0149] / / This indicates that the alarm has started playing. The alarm will continue.
[0150] / / Play until you receive the OnAlarmSoundingFinishedEvent.
[0151] virtual void OnAlarmSoundingStartedEvent () = 0;
[0152] / / The alarm has finished playing.
[0153] virtual void OnAlarmSoundingFinishedEvent () = 0;
[0154] / / Indicates that the timer has started playing. The timer will continue.
[0155] / / Announce until OnTimerSoundingFinishedEvent is received.
[0156] virtual void OnTimerSoundingStartedEvent () = 0;
[0157] / / Indicates that the timer has finished playing.
[0158] virtual void OnTimerSoundingFinishedEvent () = 0;
[0159] / / Indicates that the default volume has been changed (its
[0160] / / For example, if it occurs when the user says "turn up the volume", no need to...
[0161] / / Specify alarm or other specific volume type.)| new_volume |
[0162] / / Indicates the new default volume from 0.0 to 1.0.
[0163] virtual void OnDefaultVolumeChangeEvent (float
[0164] new_volume) = 0;
[0165] / / Indicates that the voice assistant client library is outdated for the server.
[0166] / / An update is needed. When this happens, the client...
[0167] / / No longer interacting with the server
[0168] virtual void OnClientLibraryOutOfDateEvent () = 0;
[0169] };
[0170] The following class, "DefaultEventDelegate," defines functions for do-nothing overrides of certain events: / / Providing EventDelegate by default is not considered an implementation method. / / This is useful for rewriting only the functions that are of interest.
[0171] class DefaultEventDelegate: public EventDelegate {
[0172] public:
[0173] void OnBootingUpEvent () override {}
[0174] void OnHeardHotwordEvent () override {}
[0175] void OnRecognizingSpeechStartedEvent () override {}
[0176] void OnRecognizedSpeechChangedEvent (const
[0177] RecognizedSpeechChangedEvent&event)override {}
[0178] void OnRecognizingSpeechFinishedEvent (const
[0179] RecognizedSpeechChangedEvent&event)override {}
[0180] void OnRespondingStartedEvent () override {}
[0181] void OnRespondingFinishedEvent () override {}
[0182] void OnAlarmSoundingStartedEvent () override {}
[0183] void OnAlarmSoundingFinishedEvent () override {}
[0184] void OnTimerSoundingStartedEvent () override {}
[0185] void OnTimerSoundingFinishedEvent () override {}
[0186] void OnDefaultVolumeChangeEvent (float new_volume)
[0187] override {}
[0188] void OnClientLibraryOutOfDateEvent () override {}
[0189] };
[0190] The following class "Settings" defines the settings that can be provided to controller 402 (e.g., location, geolocation, file system directory).
[0191] / / Assistant settings provided to the controller. They must be
[0192] / / Provided to the controller when the assistant is launched. They can also...
[0193] / / It is updated and then provided to the Reconfigure function to take effect.
[0194] / / Embedded applications should not create their own classes derived from this.
[0195] class Settings {
[0196] public:
[0197] virtual~Settings() {}
[0198] / / Create a default BuildInfo object.
[0199] virtual std::unique_ptr <buildinfo>
[0200] CreateDefaultBuildInfo () = 0;
[0201] / / Set the device's geographic location. Optional.
[0202] virtual void SetGeolocation (const Geolocation&
[0203] geolocation) = 0;
[0204] / / Set the compilation information for the device.
[0205] virtual void SetBuildInfo (const BuildInfo&
[0206] build_info) = 0;
[0207] / / Sets the file system directory that the voice assistant client library can use.
[0208] This directory should be cleared whenever the voice assistant client library loses all previous contexts.
[0209] / / For example, when factory data
[0210] / / When a reset occurs.
[0211] virtual void SetAssistantDirectory (const
[0212] std::string&path) = 0;
[0213] / / Set UserAgent to connect to the server.
[0214] virtual void SetUserAgent (const std:: string&
[0215] user_agent) = 0;
[0216] / / The location where the equipment is installed.
[0217] virtual void SetLocaleInfo (const LocaleInfo&
[0218] locale_info) = 0;
[0219] };
[0220] The class "Controller" below corresponds to controller 402, and the functions Start(), Reconfigure(), RegisterAction(), and RegisterEventObserver() correspond to functions Start() 404, Reconfigure() 408, RegisterAction() 406, and RegisterEventObserver() 410, respectively.
[0221] / / The controller class for the assistant.
[0222] class Controller {
[0223] public:
[0224] virtual~Controller () {}
[0225] / / Create a new default settings object that the application should configure.
[0226] / / Then pass it to Start.
[0227] virtual std::unique_ptr <settings>
[0228] CreateDefaultSettings() = 0;
[0229] / / Start the assistant and return immediately. Returns true on success. / / False on failure. Each process will only succeed once. |settings| is / / Settings for the assistant module. These are passed via const reference. / / So it's clear that the caller retains the set object and / / Any subsequent changes will have no effect unless passed to Reconfigure.
[0230] / / This function will fail if no required settings are configured.
[0231] virtual bool Start (const Settings & settings) = 0;
[0232] / / Reconfigure the Run Assistant and return immediately.
[0233] / / Return false in cases of failure, including when the assistant has not yet started.
[0234] / / |settings| is the new settings for the voice assistant module.
[0235] / / This function will fail if no required settings are configured.
[0236] virtual bool Reconfigure (const Settings & settings) = 0;
[0237] / / Register action | module |. If already registered, then fail.
[0238] virtual bool RegisterAction (std:: unique_ptr <actionmodule>)
[0239] module) = 0;
[0240] / / Register EventDelegate to receive all helper events.
[0241] virtual void RegisterEventObserver (
[0242] std::unique_ptr <eventdelegate>delegate) = 0;
[0243] / / Call this function to create the controller class for the control assistant.
[0244] / / |platform| must be set to a pointer to the platform API.
[0245] / / The assistant will use it. Returns nullptr on error.
[0246] static ASSISTANT_EXPORT std::unique_ptr <controller>
[0247] Create (std:: unique_ptr <platformapi>platform_api); }; In some implementations, the voice assistant client device 104 or projection device 106 implements a platform (e.g., a set of interfaces for communicating with other devices using the same platform and an operating system configured to support that set of interfaces). The example code below illustrates functions associated with the interface for interacting with the voice assistant client library 402 on the same platform.
[0248] The following class "Authentication" defines the authentication token used to authenticate a user of the voice assistant with a specific account: / / The platform's authentication provider.
[0249] class Authentication {
[0250] public:
[0251] / / Returns the authentication scope of the authentication token.
[0252] virtual std:: string GetGoogleOAuth2Scopes () = 0;
[0253] / / Returns the authentication token.
[0254] virtual bool GetGoogleOAuth2Token (std:: string
[0255] token) = 0;
[0256] protected:
[0257] virtual~Authentication () = default;
[0258] };
[0259] The following class, "OutputStreamType", defines the type of the audio output stream: / / Possible audio output stream types.
[0260] enum OutputStreamType {
[0261] KTts, kAlarm, kCalibration, }; The following class "SampleFormat" defines the supported audio sample formats (e.g., PCM format): / / Supported PCM sample formats.
[0262] enum SampleFormat {
[0263] kInterleavedS16, / / Interleaved signed 16-bit integers.
[0264] kInterleavedS32, / / Interleaved signed 32-bit integers.
[0265] kInterleavedF32, / / Interleaved 32-bit floating-point number.
[0266] kPlanarS16, / / A signed 16-bit integer in a plane.
[0267] kPlanarS32, / / A signed 32-bit integer in a plane.
[0268] kPlanarF32, / / 32-bit floating-point number in the plane.
[0269] };
[0270] The following "BufferFormat" defines the format of the data stored in the audio buffer at the device: / / Information about the data format stored in the audio buffer.
[0271] struct BufferFormat {
[0272] int sample_rate;
[0273] SampleFormat sample_format;
[0274] int num_channels;
[0275] };
[0276] The class "AudioBuffer" defines a buffer for audio data: / / A buffer class used for inputting / outputting audio data.
[0277] class AudioBuffer {
[0278] public:
[0279] / / Returns the format of the data in the buffer.
[0280] virtual BufferFormat GetFormat () const = 0;
[0281] / / Immutable data; used by the AudioInput delegate to read the input.
[0282] / / data.
[0283] virtual const char GetData() const = 0;
[0284] / / Data that can be written; used by the AudioOutput delegate to write more.
[0285] / / Output data.
[0286] virtual char GetWritableData() const = 0;
[0287] / / Returns the contents contained in
[0288] The number of audio frames in / / GetData() / GetWritableData().
[0289] virtual int GetFrames () const = 0;
[0290] protected:
[0291] virtual~AudioBuffer () {}
[0292] };
[0293] The following class "AudioOutput" defines the interface for audio output: / / Interface used for audio output.
[0294] class AudioOutput {
[0295] public:
[0296] enum Error{
[0297] kFatalError, kUnderrun, }; class Delegate{ public: / / Called when more audio data is needed. Delegate / / The implementation method must fill the |buffer| with data as quickly as possible. / / Call |done_cb| as soon as data is written. Note / / A delegate can partially fill the buffer, but / / | bytes_written | The number must be a multiple of the frame size. Delegation / / Does not acquire ownership of the | buffer |.
[0298] / / Note that this method cannot be blocked. If no data is available.
[0299] / / Immediately fill the buffer, can
[0300] / / The buffer is filled asynchronously by any thread, and then |done_cb| must be called.
[0301] / / | done_cb | Cannot be called after the stream has been stopped by calling Stop().
[0302] / / If the end of the stream has been reached, then
[0303] / / The delegate must be called with 0 | bytes_written | done_cb |.
[0304] virtual void FillBuffer (AudioBuffer buffer const std::function<void (int frames_written)> &done_cb) = 0; / / Call to indicate the end of the stream (i.e., / / The point where 0 | bytes_writted | is passed to the delegate of | done_cb | in FillBuffer()) / / Played. Once this is called, it is safe to call Stop(). / / Without any risk of unplayed data.
[0305] virtual void OnEndOfStream () = 0;
[0306] / / Called when an output error occurs.
[0307] virtual void OnError (Error error) = 0;
[0308] / / Called once output stops. Once this method is...
[0309] / / This call will prevent further calls to any delegate methods, unless
[0310] / / Output starts again.
[0311] virtual void OnStopped () = 0;
[0312] protected:
[0313] ~Delegate() {}
[0314] };
[0315] virtual~AudioOutput() {}
[0316] / / Returns the stream type of this output, which is in
[0317] / / Output the value specified when the application is created.
[0318] virtual OutputStreamType GetType () = 0;
[0319] / / Begin audio output. This will be in the given |format| format.
[0320] / / Begin requesting the buffer by calling the FillBuffer() method of the delegate.
[0321] virtual void Start (const BufferFormat& format, Delegate delegate) = 0; / / Stop audio output; placing this interface in Start() will allow you to do so. / / Safely call and delegate the state again with a new audio format.
[0322] / / When calling Stop(), the delegate should be discarded.
[0323] / / Any unplayed data.
[0324] / / Once the process stops and there are no more calls to the delegate. / / The delegate's OnStopped() method will be called.
[0325] virtual void Stop() = 0;
[0326] / / Sets the volume range for this output stream.
[0327] / / As long as the volume is within the range of | min_volume | <= volume <= | max_volume |. / / The volume of this stream should follow the default volume (therefore, use the default). / / Volume, but limited to a given range). | min_volume | and / / | max_volume | has a value of 0.0 <= v <= 1.0, and indicates / / A portion of the system's total possible output volume.
[0328] virtual void SetVolume (float min_volume, float
[0329] max_volume) = 0;
[0330] };
[0331] The following class "AudioInput" defines the interface for capturing audio input: / / An interface used to capture audio input. Initially, this should capture... / / Audio from all microphones, and provide data from each microphone as / / In the provision to the delegate / / A single channel in the buffer of the OnBufferAvailable() method.
[0332] class AudioInput {
[0333] public:
[0334] enum Error {
[0335] kFatalError, kOverrun, }; class Delegate{ public: / / Called when more input audio data is available. |timestamp| is / / Time in microseconds (relative to the CLOCK_MONOTONIC_RAW period) / / Data in the |buffer | is captured (for loopback audio, it is / / The estimated timestamp when the data will be played.
[0336] virtual void OnBufferAvailable (
[0337] const AudioBuffer& buffer, int64_t timestamp) = 0; / / Called when an error occurs in AudioInput.
[0338] virtual void OnError (Error error) = 0;
[0339] / / Called once input stops. Once this method is...
[0340] / / This call will prevent further calls to any delegate methods, unless
[0341] / / Start typing again.
[0342] virtual void OnStopped () = 0;
[0343] };
[0344] virtual~AudioInput() {}
[0345] / / Start capturing audio input and passing it to the delegate.
[0346] / / OnBufferAvailable () method.
[0347] virtual void Start (Delegate delegate) = 0;
[0348] / / Stop capturing audio input. Once input stops and...
[0349] / / Any delegated methods will no longer be called, the delegate's OnStopped() method
[0350] / / Will be called.
[0351] virtual void Stop() = 0;
[0352] };
[0353] The following class "Resources" defines access to system resources: / / Access system resource files.
[0354] class Resources {
[0355] public:
[0356] using ResourceLoadingCallback = std::function <void(
[0357] const std::string& output)>
[0358] Resources() {}
[0359] virtual ~Resources() {}
[0360] virtual bool GetBuiltinHotwordData(
[0361] const LocaleInfo locale, const ResourceLoadingCallback& callback) = 0
[0362] virtual bool GetAlarmMp3(const ResourceLoadingCallback&
[0363] callback) = 0
[0364] virtual bool GetTimerMp3(const ResourceLoadingCallback&
[0365] callback) = 0
[0366] virtual bool GetCalibrationMp3(const
[0367] ResourceLoadingCallback& callback) = 0
[0368] virtual bool GetVolumeChangeMp3(const
[0369] ResourceLoadingCallback& callback) = 0
[0370] virtual bool GetSpeechRecognitionErrorMp3(
[0371] const LocaleInfo locale, const ResourceLoadingCallback& callback) = 0
[0372] virtual bool GetSpeechRecognitionStoppedMp3(
[0373] const LocaleInfo locale, const ResourceLoadingCallback& callback) = 0
[0374] virtual bool GetNoInternetMp3(const LocaleInfo locale, const ResourceLoadingCallback& callback) = 0
[0375] }
[0376] The following class "PlatformApi" specifies the platform API (e.g., platform API 244) for the voice assistant client library 240: / / Use the platform API for voice assistants.
[0377] class PlatformApi {
[0378] public:
[0379] virtual~PlatformApi() {}
[0380] / / Returns the required audio output interface of type | for the stream.
[0381] / / This is owned by PlatformAPI.
[0382] virtual std::unique_ptr <audiooutput>GetAudioOutput (
[0383] OutputStreamType(type) = 0;
[0384] / / Returns the interface used to capture audio input.
[0385] virtual std::unique_ptr <audioinput>GetAudioInput() = 0;
[0386] / / Returns the interface used to capture loopback audio. This is
[0387] / / The captured data is the audio data that is about to be played.
[0388] / / "Audio input".
[0389] / / All mixing and post-processing are complete / / As soon as possible before sending to the output hardware, / / Both can capture loopback audio.
[0390] virtual std::unique_ptr <audioinput>
[0391] GetLoopbackInput() = 0;
[0392] virtual Authentication& GetAuthentication () = 0;
[0393] };
[0394] In some implementations, volume control can be handled outside of the voice assistant client library 240. For example, the system volume can be maintained by the device outside of the control of the voice assistant client library 240. As another example, the voice assistant client library 240 can still support volume control, but volume control requests to the voice assistant client library 240 are directed to the device.
[0395] In some implementations, the alarm and timer functions in the voice assistant client library 240 can be disabled by the user or when the library is implemented at the device.
[0396] In some implementations, the voice assistant client library 240 also supports an interface to LEDs on the device to facilitate the display of LED animations on the device LEDs.
[0397] In some implementations, the voice assistant client library 240 may be included in or linked to a projection receiver module (e.g., receiver module 146) at the projection device 106. The link between the voice assistant client library 240 and the receiver module 146 may include, for example, support for additional actions (e.g., local media playback) and support for control of LEDs on the projection device 106.
[0398] Figure 5 A flowchart of a method 500 for processing verbal input on a processing device according to some embodiments is shown. Method 500 is executed at an electronic device (e.g., voice assistant client device 104, projection device 106) having an audio input system (e.g., audio input device 108 / 132), one or more processors (e.g., processing unit 202), and a memory (e.g., memory 206) storing one or more programs executed by the one or more processors. In some embodiments, the electronic device includes an audio input system (e.g., audio input device 108 / 132), one or more processors (e.g., processing unit 202), and a memory (e.g., memory 206) storing one or more programs executed by the one or more processors, the one or more programs including instructions for performing method 500. In some embodiments, a non-transitory computer-readable storage medium stores one or more programs, the one or more programs including instructions that cause the electronic device to perform method 500 when executed by an electronic device having an audio input system (e.g., audio input device 108 / 132) and one or more processors (e.g., processing unit 202). The program or instructions for performing method 500 may be included in the above references. Figure 2-4 The modules, libraries, etc. described.
[0399] The device receives (502) verbal input at the device. The client device 104 / projection device 106 captures verbal input (e.g., voice input) issued by the user.
[0400] Device processing (504) of verbal input. Client device 104 / projection device 106 processes verbal input. This processing may include hot word detection, conversion to text data, and identification of words and phrases corresponding to user-provided commands, requests, and / or parameters. In some implementations, the processing may be minimal or nonexistent. For example, processing may include encoding the audio of the verbal input for transmission to server 114, or preparing the raw audio of the captured verbal input for transmission to server 114.
[0401] The device transmits a request (506) to a remote system, the request including information determined based on verbal input. Client device 104 / projection device 106 determines the request from the verbal input by processing it to identify the request and one or more associated parameters. Client device 104 / projection device 106 transmits the determined request to the remote system (e.g., server 114), where the remote system determines and generates a response to the request. In some embodiments, client device 104 / projection device 106 transmits voice input (e.g., as coded audio, as raw audio data) to server 114, and server 114 processes the verbal input to determine the request and associated parameters.
[0402] The device receives (508) a response to the request, wherein the response is generated by the remote system based on information based on verbal input. The remote system (e.g., server 114) determines and generates a response to the request and transmits the response to the client device 104 / projection device 106.
[0403] The device performs operation (510) based on the response. The client device 104 / projection device 106 performs one or more operations based on the received response. For example, if the response is a command to output some information to the device via audio, the client device 104 / projection device 106 retrieves the information, converts the information into voice audio output, and outputs the voice audio through a speaker. As another example, if the response is a command for the device to play media content, the client device 104 / projection device 106 retrieves the media content and plays the media content.
[0404] One or more of receiving, processing, transmitting, receiving, and executing steps are performed by one or more voice processing modules of a voice assistant library running on the electronic device. These voice processing modules provide multiple voice processing operations accessible to one or more applications and / or operating software that are executed or executable on the electronic device (512). The client device 104 / projection device 106 may have a voice assistant client library 240, which includes functions and modules for performing one or more of the receiving, processing, transmitting, receiving, and executing steps. The modules of the voice assistant client library 240 provide multiple voice processing and assistant operations accessible to applications, operating systems, and platform software (e.g., runtime library 240 and associated APIs) at the client device 104 / projection device 106 that includes or is linked to the library 240.
[0405] In some implementations, at least some voice processing operations associated with the voice processing module are performed on a remote system interconnected with an electronic device via a wide area network. For example, processing to determine the verbal input of a request may be performed by a server 114, which is connected to the client device 104 / projection device 106 via a network 112.
[0406] In some implementations, the voice assistant library is executable on a common operating system that is operable on multiple different device types, thereby achieving portability of voice-enabled applications configured to interact with one or more voice processing operations. The voice assistant client library 240 (and related libraries and APIs, such as debug library 242, platform API 244, POSIX API 246) uses standard components (e.g., objects) of a predefined operating system (e.g., Linux), and is therefore operable on various devices running distributions or features of the predefined operating system (e.g., different Linux distributions or features). In this way, voice assistant functionality is available on various devices, and the voice assistant experience is consistent across devices.
[0407] In some implementations, requests and responses can be processed at the device itself. For example, for basic device-local functions such as timers, alarms, clocks, and volume control, client device 104 / projection device 106 can process verbal input and determine that the request corresponds to one of these basic functions, determine the response at the device, and perform one or more operations based on the response. For logging purposes, the device can still report requests and responses to server 114.
[0408] In some implementations, a device-agnostic voice assistant library for electronic devices including audio input systems includes one or more voice processing modules configured to execute on a common operating system implemented on multiple different types of electronic devices. The voice processing modules provide multiple voice processing operations accessible to applications and operating software executable on the electronic device, thereby enabling the portability of voice-enabled applications configured to interact with one or more voice processing operations. Voice assistant client library 240 is a library that can run on various devices sharing the same predefined operating system foundation (e.g., the library and device operating systems are Linux-based), and therefore the library is device-agnostic. Library 240 provides multiple modules for application-accessible voice assistant functionality on various devices.
[0409] In some implementations, at least some voice processing operations associated with the voice processing module are performed on a back-end server interconnected with an electronic device via a wide area network. For example, library 240 includes a module that communicates with server 114 to transmit verbal input to server 114 for processing to determine a request.
[0410] In some implementations, voice processing operations include device-specific operations configured to control devices coupled to electronic devices (e.g., directly or communicatively). Library 240 may include functions or modules for controlling other devices coupled to client device 104 / projection device 106 (e.g., wireless speakers, smart TVs, etc.).
[0411] In some implementations, voice processing operations include information and media request operations configured to provide requested information and / or media content to a user of the electronic device or a user on a device coupled to the electronic device (e.g., directly or communicatively). Library 240 may include functions or modules for retrieving information or media and providing it on the client device 104 / projection device 106 or on the coupled device (e.g., reading emails aloud, reading news articles aloud, playing streaming music).
[0412] It should be understood that while the terms "first," "second," etc., may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first contact may be referred to as a second contact, and similarly, a second contact may be referred to as a first contact, which changes the meaning of the description, provided that all instances of "first contact" are consistently renamed and all instances of "second contact" are consistently renamed. First contact and second contact are both contacts, but they are not the same contact.
[0413] The terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the claims. As used in the description of embodiments and appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term "and / or" as used herein refers to and includes any and all possible combinations of one or more of the associated listed items. It will be further understood that, when used in this specification, the terms "comprising" and / or "including" specify the presence of the stated features, integrals, steps, operations, elements, and / or components, but do not exclude the presence or addition of one or more other features, integrals, steps, operations, elements, components, and / or combinations thereof.
[0414] As used herein, the term "if" can be interpreted, depending on the context, as meaning "when" or "in response to determination" or "according to determination" or "in response to detection" until the precondition is true. Similarly, the phrase "if determination [the stated precondition is true]" or "if [the stated precondition is true]" or "when [the stated precondition is true]" can be interpreted, depending on the context, as meaning "in response to determination" or "according to determination" or "according to detection" or "in response to detection" until the stated precondition is true.
[0415] Various embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. Numerous specific details are set forth in the following detailed description to provide a thorough understanding of the invention and the described embodiments. However, the invention may be practiced without these specific details. In other instances, well-known methods, processes, components, and circuits have not been described in detail so as not to unnecessarily obscure aspects of the embodiments.
[0416] For purposes of explanation, the foregoing description has been given with reference to specific embodiments. However, the illustrative discussion above is not intended to be exhaustive or to limit the invention to the exact forms disclosed. In view of the foregoing teachings, many modifications and variations are possible. The embodiments were chosen and described in order to best explain the principles of the invention and its practical application, thereby enabling others skilled in the art to best utilize the invention and its various embodiments with various modifications suitable for the intended particular use.< / audioinput> < / audioinput> < / audiooutput> < / platformapi> < / controller> < / eventdelegate> < / actionmodule> < / settings> < / buildinfo> < / args>
Claims
1. A method comprising: In an electronic device that includes an audio input system, one or more processors, and memory storing one or more programs to be executed by the one or more processors: Download a device-agnostic voice assistant library, wherein the voice assistant library includes a controller interface configured to: initiate a voice assistant at the electronic device, register an action using the voice assistant, reconfigure the voice assistant with updated settings, and register a set of functions for basic events using the voice assistant; Verbal input is received from the user via the microphone of the audio input system; Request information is extracted from the verbal input by processing the verbal input using a device-agnostic voice assistant library executed on the electronic device; Send a request to a remote system, the request including the extracted request information; Receive a response to the request, wherein the response is generated by the remote system based on the extracted request information; and Perform the operation based on the response.
2. The method of claim 1, wherein at least some of the voice processing operations associated with a voice assistant library unknown to the device are performed on the remote system interconnected with the electronic device via a wide area network.
3. The method of claim 1, wherein processing the spoken input includes performing voice processing on the spoken input, and the voice processing is performed by a voice assistant library unknown to the device.
4. The method of claim 1, wherein processing the verbal input includes performing audio input processing on audio data of the verbal input, and the audio input processing is performed by a device-agnostic voice assistant library.
5. The method of claim 1, wherein performing the operation according to the response includes decoding audio, and the audio decoding is performed by a voice assistant library unknown to the device.
6. The method of claim 1, wherein performing the operation includes outputting an audible response to the user via the audio input system.
7. The method of claim 1, wherein configuring the device-agnostic voice assistant library includes implementing voice assistant functionality on the electronic device.
8. The method of claim 1, further comprising: Download the debug library, which is configured to provide modules and functions for debugging.
9. A system comprising: Storage devices; A processor communicatively coupled to the storage device, wherein the processor executes application code instructions stored in the storage device to cause the system to perform the method according to any one of claims 1 to 8.
10. A non-transitory computer-readable storage medium storing instructions that, when executed by a data processing apparatus, cause the data processing apparatus to perform the method according to any one of claims 1 to 8.