Estimating keyword length refinement based on speech rate classification
By using a multi-level keyword detection system and a speech rate classification engine, the high power consumption problem of always having the keyword detection function running has been solved, resulting in more efficient keyword detection and extended device uptime.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- QUALCOMM INC
- Filing Date
- 2024-11-19
- Publication Date
- 2026-06-19
AI Technical Summary
The always-on keyword detection function in existing electronic devices leads to high power consumption, especially for battery-powered devices and IoT devices, which shortens battery life and consumes system resources.
A multi-level keyword detection system is adopted, including a first-level keyword detection system and a second-level keyword detection system. Combined with a speech rate classification engine, the keyword index is refined through a speech rate classification machine learning network to generate more accurate keyword start and end indexes, thereby reducing latency and power consumption.
It improved the accuracy of keyword detection, reduced system latency, decreased power consumption, and extended equipment uptime.
Smart Images

Figure CN122249853A_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates in general to speech recognition. In some specific embodiments, examples are described for performing multi-level speech recognition based on refining the estimated keyword index using speech rate information. Background Technology
[0002] Electronic devices such as smartphones, tablets, wearable devices, and smart TVs are becoming increasingly popular among consumers. These devices provide voice and / or data communication capabilities via wireless or wired networks. Furthermore, such electronic devices may include additional features that provide a variety of functions designed to enhance user convenience. Electronic devices may include speech recognition capabilities for receiving voice commands from the user. When a voice command from the user is received and recognized, this capability allows the electronic device to perform functions associated with the voice command (e.g., via keywords). For example, the electronic device may activate a voice-assisted application, play an audio file, or take a picture in response to a voice command from the user.
[0003] Speech recognition can be implemented as an "always-on" feature in electronic devices to maximize its utility. These always-on features require continuously running software and / or hardware resources, resulting in persistent power consumption. Mobile electronic devices, Internet of Things (IoT) devices, and the like are particularly sensitive to this always-on power demand because it can shorten battery life and consume other limited system resources, such as processing power. Summary of the Invention
[0004] The following is a simplified summary of the invention relating to one or more aspects disclosed herein. Therefore, this summary should not be considered an exhaustive overview relating to all conceived aspects, nor should it be considered to identify key or decisive elements relating to all conceived aspects or to depict the scope associated with any particular aspect. Thus, the sole purpose of this summary is to present, in a simplified form, certain concepts relating to one or more aspects involving the mechanisms disclosed herein, prior to the detailed description presented below.
[0005] Systems, methods, apparatuses, and computer-readable media for processing one or more audio samples are disclosed. According to at least one exemplary example, a method for processing one or more audio samples is provided. The method may include: detecting spoken keywords within audio samples in the one or more audio samples using a first keyword detection model; determining an estimated keyword index corresponding to the detection of spoken keywords within the audio samples, the estimated keyword index including an estimated keyword start index and an estimated keyword end index; determining speech rate information corresponding to the audio samples using a speech rate classification machine learning network; obtaining an average spoken length value corresponding to the spoken keywords and the speech rate information; and generating a refined keyword index based on the estimated keyword index and the average spoken length value, wherein the refined keyword index includes a refined keyword start index offset to a time earlier than the estimated keyword start index and a refined keyword end index offset to a time later than the estimated keyword end index.
[0006] In another exemplary example, an apparatus for processing one or more audio samples is provided. The apparatus includes: one or more memories; and one or more processors coupled to the one or more memories. The one or more processors are configured and capable of performing the following operations: detecting spoken keywords within audio samples in the one or more audio samples using a first keyword detection model; determining an estimated keyword index corresponding to the detection of spoken keywords within the audio samples, the estimated keyword index including an estimated keyword start index and an estimated keyword end index; determining speech rate information corresponding to the audio samples using a speech rate classification machine learning network; obtaining an average spoken length value corresponding to the spoken keywords and the speech rate information; and generating a refined keyword index based on the estimated keyword index and the average spoken length value, wherein the refined keyword index includes a refined keyword start index offset to a time earlier than the estimated keyword start index and a refined keyword end index offset to a time later than the estimated keyword end index.
[0007] In another exemplary example, a non-transitory computer-readable storage medium includes instructions stored thereon that, when executed by at least one processor, cause the at least one processor to: detect spoken keywords within audio samples in one or more audio samples using a first keyword detection model; determine an estimated keyword index corresponding to the detection of spoken keywords within the audio samples, the estimated keyword index including an estimated keyword start index and an estimated keyword end index; determine speech rate information corresponding to the audio samples using a speech rate classification machine learning network; obtain an average spoken length value corresponding to the spoken keywords and speech rate information; and generate a refined keyword index based on the estimated keyword index and the average spoken length value, wherein the refined keyword index includes a refined keyword start index offset to a time earlier than the estimated keyword start index and a refined keyword end index offset to a time later than the estimated keyword end index.
[0008] In another exemplary example, an apparatus is provided. The apparatus includes: means for detecting spoken keywords within audio samples in one or more audio samples using a first keyword detection model; means for determining an estimated keyword index corresponding to the detection of spoken keywords within the audio samples, the estimated keyword index including an estimated keyword start index and an estimated keyword end index; means for determining speech rate information corresponding to the audio samples using a speech rate classification machine learning network; means for obtaining an average spoken length value corresponding to the spoken keywords and the speech rate information; and means for generating a refined keyword index based on the estimated keyword index and the average spoken length value, wherein the refined keyword index includes a refined keyword start index offset to a time earlier than the estimated keyword start index and a refined keyword end index offset to a time later than the estimated keyword end index.
[0009] The aspects generally include methods, apparatus, systems, computer program products, non-transitory computer-readable media, user equipment, base stations, wireless communication devices and / or processing systems, as fully described herein with reference to the accompanying drawings and description, and as illustrated in the accompanying drawings and description.
[0010] The features and technical advantages of the examples according to this disclosure have been summarized rather extensively above in order to better understand the detailed description below. Additional features and advantages will be described below. The disclosed concepts and specific examples can be readily utilized as the basis for modifying or designing other structures for achieving the same purpose of this disclosure. Such equivalent constructions do not depart from the scope of the appended claims. The characteristics of the concepts disclosed herein, in both their organization and manner of operation, and the associated advantages, will be better understood by considering the following description in conjunction with the accompanying drawings. Each drawing provided in the drawings is for illustrative and descriptive purposes and not as a limitation of the definitions in the claims.
[0011] While aspects are described herein by way of example, those skilled in the art will understand that such aspects can be implemented in many different arrangements and scenarios. The techniques described herein can be implemented using different platform types, devices, systems, shapes, sizes, and / or package arrangements. For example, some aspects may be implemented via integrated chip implementations or other devices based on non-modular components (e.g., end-user equipment, vehicles, communication equipment, computing devices, industrial equipment, retail / shopping devices, medical devices, and / or artificial intelligence devices). Aspects can be implemented in chip-level components, modular components, non-modular components, non-chip-level components, device-level components, and / or system-level components. Devices incorporating the described aspects and features may include additional components and features for implementing and practicing the claimed and described aspects. For example, the transmission and reception of wireless signals may include one or more components for analog and digital purposes (e.g., hardware components including antennas, radio frequency (RF) chains, power amplifiers, modulators, buffers, processors, interleavers, adders, and / or summers). The aspects described herein are intended to be practiced in a wide variety of devices, components, systems, distributed arrangements, and / or end-user equipment of various sizes, shapes, and configurations.
[0012] Based on the accompanying drawings and detailed description, other objects and advantages associated with the aspects disclosed herein will be apparent to those skilled in the art. This summary is not intended to identify key or essential features of the claimed subject matter, nor is it intended to be used in isolation to define the scope of the claimed subject matter. This subject matter should be understood with reference to the appropriate portions of the entire specification, any or all of the drawings, and each claim.
[0013] The foregoing and other features and aspects will become more apparent from the following description, claims and accompanying drawings. Attached Figure Description
[0014] The accompanying drawings are provided to help describe various aspects of this disclosure, and are provided for illustrative purposes only and not to limit the aspects.
[0015] Figure 1 This is a block diagram illustrating an example speech recognition system based on some examples;
[0016] Figure 2 This is a block diagram illustrating a sample keyword detection system based on some examples;
[0017] Figure 3 This is a block diagram illustrating an example feature generator based on some examples;
[0018] Figures 4A to 4C This is a diagram illustrating an example of a neural network based on some examples;
[0019] Figure 5 This is a block diagram illustrating an example of a deep convolutional network (DCN) based on some examples;
[0020] Figure 6 This is a diagram illustrating an example of the estimated start and end indices of a keyword detected in some example audio samples, as well as the baseline truth start and end indices of that keyword in the audio samples.
[0021] Figure 7 This is a diagram illustrating corresponding example audio samples associated with keywords spoken at fast, normal, and slow speeds, based on some examples.
[0022] Figure 8 This is a block diagram illustrating an example keyword detection system that uses keyword length refinement based on speech rate classification information, based on some examples;
[0023] Figure 9 This is an example of what can be used to implement based on some examples. Figure 8 A block diagram illustrating an example of a speech rate classification network and a keyword index refinement engine for a keyword detection system;
[0024] Figure 10 This is a flowchart illustrating an example of a process for processing one or more audio samples, based on some examples; and
[0025] Figure 11 This is a block diagram illustrating an example of a computing system used to implement some of the aspects described in this paper. Detailed Implementation
[0026] Certain aspects of this disclosure are provided below for illustrative purposes. Alternative aspects may be devised without departing from the scope of this disclosure. Additionally, well-known elements of this disclosure will not be described in detail or will be omitted so as not to obscure the relevant details of this disclosure. Some of the aspects described herein can be applied independently, and some of them can be combined, as will be apparent to those skilled in the art. In the following description, specific details are set forth for illustrative purposes to provide a thorough understanding of various aspects of this application. However, it will be apparent that various aspects can be practiced without these specific details. The figures and descriptions are not intended to be limiting.
[0027] The following description provides only exemplary aspects and is not intended to limit the scope, applicability, or configuration of this disclosure. Rather, the following description of the exemplary aspects will provide those skilled in the art with a description that can be used to implement the exemplary aspects. It should be understood that various changes can be made to the function and arrangement of the elements without departing from the scope of this application as set forth in the appended claims.
[0028] Voice recognition generally refers to the identification of human voices by electronic devices in order to perform a certain function. One type of voice recognition is keyword detection (e.g., wake word detection). Keyword detection is a technique where a device detects specific words and responds accordingly. For example, many consumer electronics can utilize keyword detection to identify specific keywords to perform certain actions, such as "wake up" the device, query information, and / or enable the device to perform various other functions. Voice recognition can also be used for more complex functions, such as far-field voice recognition (e.g., from mobile devices placed across a room), user identification verification (e.g., via voice signature), voice recognition during other audio output (e.g., detecting voice commands while music is playing on the device or detecting interrupt commands while a smart assistant is speaking), and voice interaction in complex noisy environments (such as inside a moving vehicle). These are just a few examples, and many other examples are likely to exist.
[0029] Like many other processing tasks on electronic devices, voice recognition requires power and dedicated hardware and / or software to operate. Furthermore, voice recognition can be implemented as an "always-on" function (e.g., continuously monitoring audio for keyword detection) to maximize its utility for users of voice recognition-enabled electronic devices. For plugged-in devices, the power usage of always-on voice recognition is primarily a matter of efficiency optimization; however, for power-sensitive devices equipped with this function (e.g., battery-powered devices, mobile electronic devices, IoT devices, etc.), power usage is a more critical concern. For example, the power usage of always-on functionality may limit the uptime of such devices and reduce the capacity required for processing other systems.
[0030] Voice recognition may include voice activity detection. For example, voice activity detection may refer to the detection of human voice by a computing device to perform a function. For example, keyword detection (e.g., also known as keyword recognition and / or keyword retrieval (KWS)) is the task of detecting one or more keywords in an audio signal (e.g., an audio signal containing human speech or spoken words). For example, keyword detection can be used to distinguish activation phrases or specific commands from other speech and noise in an audio signal. In some cases, keyword detection systems may be targeted at or utilized by edge devices such as mobile phones and smart speakers. Detected keywords may include single words, compound words, phrases containing multiple words, etc. In some cases, keyword detection may be performed based on a pre-determined set of keywords and / or a user-defined set of keywords. In some cases, user-defined keywords may include one or more adaptations, adjustments, etc., determined based on specific characteristics of a given user's voice or speech.
[0031] Keyword detection can be performed on one or more audio data inputs (e.g., also referred to herein as "audio data," "audio signal," and / or "audio sample"). For example, the audio sample provided to the keyword detection system can be a streaming audio signal. In some examples, keyword detection can be performed on the streaming audio signal in real time. The streaming audio signal can be recorded by or obtained from a microphone associated with a computing device. Keyword detection can be performed locally or remotely. For example, keyword detection can be performed locally using one or more processors of the same computing device that collects or obtains the streaming audio signal. In some examples, keyword detection can be performed remotely by sending the streaming audio signal (or its representation) from the local computing device to a remote computing device (e.g., the local computing device records the audio signal but offloads the keyword detection processing task to the remote computing device). Performing keyword detection locally can reduce total latency or computation time but results in decreased accuracy. Performing keyword detection remotely can increase latency but result in increased accuracy.
[0032] For example, local computing devices (e.g., smartphones) typically have lower computing power than remote computing devices (e.g., cloud computing systems), and therefore may generate keyword detection results with lower accuracy or overall performance, especially when subjected to time constraints associated with providing keyword detection results in real-time or near real-time. For instance, a local computing device might implement a keyword detection model with lower complexity than one implemented on a remote computing device to provide real-time keyword detection results. Lower accuracy in keyword detection results may include false positives (e.g., identifying keywords that do not actually exist), false negatives (e.g., failing to identify existing keywords), and misclassifications (e.g., identifying a first keyword as a different keyword).
[0033] However, performing keyword detection remotely can introduce communication latency that can offset the accuracy gains associated with remote keyword detection. For example, remote keyword detection can introduce latency along the communication path from the local computing device to the remote computing device (e.g., the time it takes to send a streaming audio signal or its representation to the remote computing device) and along the return communication path from the remote computing device to the local computing device (e.g., the time it takes to send the keyword detection results from the remote computing device back to the local computing device).
[0034] In some cases, multiple stages can be used to perform keyword detection. For example, multi-stage keyword detection can be used to minimize the power consumption associated with performing keyword detection on a power-sensitive device (e.g., minimizing the power consumption associated with always-on keyword detection performed by a battery-powered device such as a smartphone or other mobile computing device). In multi-stage keyword detection, one or more stages can implement a low-complexity and low-latency keyword detection model, and one or more subsequent stages can implement a more complex keyword detection model. For example, multi-stage keyword detection can be performed as a two-stage keyword detection. In such an example, the first-stage keyword detection model can be a low-complexity and low-latency keyword detection model. Based on the keyword detection output generated by the first stage (e.g., a keyword detection output with a confidence level greater than or equal to a first threshold), the second-stage keyword detection model can be activated and used to process the same audio samples (e.g., the same audio samples that triggered the detection output of the first stage).
[0035] A second-level keyword detection model can be provided as a relatively high-complexity keyword detection model (e.g., a first-level keyword detection model can be provided as a relatively low-complexity keyword detection model). The performance of a second-level keyword detection model can be higher than that of a first-level keyword detection model. A second-level keyword detection model can be used to provide double verification of keyword detection (e.g., by validating or confirming the first-level keyword detection) or to reject first-level keyword detection as false positives (e.g., to invalidate the first-level keyword detection). However, performing multi-level keyword detection and / or using multiple different keyword detection models can also be considered as increasing the end-to-end system latency of the keyword detection system.
[0036] In some cases, keyword detection can be performed in real-time (or near real-time) to allow users to interact with one or more computing devices. The lag between when a user utters a keyword (e.g., an activation phrase or specific command) and when the computing device provides a corresponding response or action can be a significant factor in a user's willingness to utilize spoken commands (e.g., spoken keywords). In some cases, a lag of several seconds can frustrate users or otherwise prevent them from using spoken keywords. Therefore, improved keyword detection performance (e.g., with reduced latency) is needed in local and / or remote keyword detection implementations, as both local and remote keyword detection implementations are typically time-limited processes.
[0037] This document describes systems, apparatus, methods (also referred to as processes), and computer-readable media (collectively, “systems and techniques”) for keyword detection systems that can be used to perform multi-level keyword detection with improved detection performance and reduced latency. For example, these systems and techniques can perform multi-level keyword detection using at least a first keyword detection level and a second keyword detection level. The first keyword detection level can be configured to perform initial keyword detection and keyword starting index estimation. The second keyword detection level can be configured to achieve better and / or more accurate keyword detection on audio samples corresponding to keywords from the initial detection of the first level. In some cases, these systems and techniques can be used to implement multi-level keyword detection systems for always-on keyword detection and / or other speech recognition tasks.
[0038] For example, the first keyword detection level can use keyword start index estimation to determine the time index corresponding to the estimated start of the detected keyword (e.g., within the processed audio sample). Keyword start index estimation can be performed after the initial keyword detection. In some aspects, based at least in part on the specific keyword detection model used to perform the keyword detection, the detection time associated with the keyword detection (e.g., detection timestamp, detection point, etc.) can be used as the estimated termination index of the keyword. For example, the keyword detection model can be implemented based on the keyword detection occurring at the end of the keyword (e.g., or near the end of the keyword in the input audio stream, etc.). The estimated start and / or estimated termination of the keyword can be used to provide a keyword buffer to the second keyword processing level, wherein the keyword buffer includes the corresponding portion of the audio sample starting from the estimated start of the detected keyword.
[0039] In an exemplary example, these systems and techniques may include a speech rate classification engine that can be used to process input audio samples in parallel with keyword detection and initial estimation at a first keyword processing level. For example, the speech rate classification engine may be provided as a speech rate classification machine learning network (e.g., a neural network, etc.) configured to determine the speech rate of the input audio sample. In some aspects, the speech rate classification engine may classify the speech rate of the input audio sample as corresponding to a fast speaker (e.g., fast speech rate), a normal speaker (e.g., normal speech rate), a slow speaker (e.g., slow speech rate), etc.
[0040] In some examples, the first keyword processing level may include a first machine learning network (e.g., a first neural network) configured to perform keyword detection based on input audio samples, and may include a second machine learning network (e.g., a second neural network) configured to perform keyword start index estimation on the detected keywords. In some aspects, based on keywords detected within the input audio samples (e.g., determined using the keyword detection machine learning network of the first level), these systems and techniques may perform keyword start index estimation (e.g., using the keyword start index estimation machine learning network of the first level) and speech rate classification (e.g., using the speech rate classification machine learning network) in parallel.
[0041] The keyword index refinement engine can be configured to determine refined (e.g., fine-tuned, updated, modified, etc.) start and / or end indices for the detected keywords at the first level. In an illustrative example, refined keyword start and / or end indices that are closer to the actual or baseline truth indices of the keywords spoken in the input audio sample can improve the performance and / or reduce the latency of a multi-level keyword detection system. The keyword index refinement engine can determine the refined keyword start and / or end indices based on estimated keyword start and end indices determined by a keyword start index estimation network and on speech rate information determined for the input audio sample using a speech rate classification network.
[0042] For example, one or more offline datasets can be used to generate average spoken keyword lengths for each corresponding keyword among one or more different keywords configured for identification by a multi-level keyword detection system. For each corresponding keyword, the offline datasets can be used to generate corresponding average spoken keyword length information for each corresponding speech rate category in a set of one or more possible speech rate categories, which can be output by a speech rate classification network. For example, where the speech rate classification network can output speech rate information indicating fast, normal, or slow speech rates, the offline datasets can be used to generate the average spoken keyword length for each corresponding keyword among one or more different keywords, uttered by a slow-speed speaker, uttered by a normal-speed speaker, and uttered by a fast-speed speaker.
[0043] In some examples, the average spoken length information for each corresponding keyword spoken by speakers at different speaking speeds can be determined offline and embedded in machine learning model metadata (e.g., neural network metadata) used to configure one or more machine learning models in a multi-level keyword detection system. Based on embedding the average spoken keyword length information in the machine learning model metadata, these systems and techniques can generate refined keyword start and end indices by fine-tuning or refining the initial keyword index estimate to correspond to the detected speaking speed classification of the input audio sample.
[0044] Other aspects of the system and technology will be described in relation to the accompanying drawings.
[0045] Figure 1 An example specific implementation of a System-on-Chip (SoC) 100 is illustrated, which may include a Central Processing Unit (CPU) 102 or a multi-core CPU configured to perform one or more of the functions described herein. Parameters or variables (e.g., neural signals and synaptic weights), system parameters associated with computing devices (e.g., a weighted neural network), latency, frequency bin information, task information, and other information may be stored in a memory block associated with a Neural Processing Unit (NPU) 108, a memory block associated with the CPU 102, a memory block associated with a Graphics Processing Unit (GPU) 104, a memory block associated with a Digital Signal Processor (DSP) 106, a memory block 118, and / or may be distributed across multiple blocks. Instructions executed at the CPU 102 may be loaded from the program memory associated with the CPU 102 or may be loaded from memory block 118.
[0046] SoC 100 may also include additional processing blocks tailored for specific functions, such as GPU 104, DSP 106, connectivity block 110 (which may include fifth-generation (5G) connectivity, fourth-generation LTE (4G LTE) connectivity, Wi-Fi connectivity, USB connectivity, Bluetooth connectivity, etc.), and multimedia processor 112 capable of detecting and recognizing, for example, gestures, speech, and / or other interactive user actions or inputs. In one specific implementation, NPU 108 is implemented within CPU 102, DSP 106, and / or GPU 104. SoC 100 may also include sensor processor 114, image signal processor (ISP) 116, and / or keyword detection system 120. In some examples, sensor processor 114 may be associated with or connected to one or more sensors for providing sensor input to sensor processor 114. For example, the one or more sensors and sensor processor 114 may be provided in the same computing device, coupled to the same computing device, or otherwise associated with the same computing device.
[0047] In some examples, one or more sensors may include one or more microphones for receiving sound (e.g., audio input), including sound or audio input that can be used to perform Keyword Search (KWS), which can be considered a specific type of keyword detection. In some cases, the sound or audio input received by one or more microphones (and / or other sensors) may be digitized into data packets for analysis and / or transmission. Audio input may include ambient sound in the vicinity of the computing device associated with the SoC 100 and / or may include speech from a user of the computing device associated with the SoC 100. In some cases, the computing device associated with the SoC 100 may be additionally or alternatively communicatively coupled to one or more peripheral devices (not shown) and / or configured to communicate with one or more remote computing devices or external resources, for example, using a wireless transceiver and a communication network such as a cellular communication network.
[0048] SoC 100, DSP 106, NPU 108, and / or keyword detection system 120 may be configured to perform audio signal processing. For example, keyword detection system 120 may be configured to perform various steps of KWS. As another example, one or more parts of the steps for voice KWS (such as feature generation) may be performed by keyword detection system 120, while DSP 106 / NPU 108 performs other steps, such as using one or more machine learning networks and / or machine learning techniques according to various aspects of this disclosure and as described herein.
[0049] Figure 2An example keyword detection first stage 200 according to various aspects of this disclosure is illustrated. The keyword detection first stage 200 receives an audio signal (e.g., pulse code modulation (PCM) audio data from an analog microphone, pulse density modulation (PDM) high-definition audio from a digital microphone, etc.) from an audio source 202 in an electronic system. For example, the audio signal may be generated by one or more microphones of an electronic device, such as a mobile electronic device, a smart home device, an Internet of Things (IoT) device, or other edge processing device. In some cases, the audio signal may be received substantially in real time.
[0050] In some cases, certain devices (such as relatively low-power (e.g., battery-powered) devices) may include a two-stage speech recognition system, where a first keyword detection stage (e.g., first keyword detection stage 200) generates keyword detection output, which can be used to activate a second keyword detection stage (e.g., second keyword detection stage 214). In multi-stage keyword detection, one or more stages can implement a low-complexity and low-latency keyword detection model, and one or more subsequent stages can implement a more complex keyword detection model.
[0051] For example, the model associated with the first keyword detection level 200 can be a low-complexity and low-latency keyword detection model. Based on the keyword detection output generated by the first level 200 (e.g., a keyword detection output with a detection score greater than or equal to a first threshold), the model associated with the second keyword detection level 214 can be activated and used to process the same audio samples (e.g., the same audio samples that triggered the detection output of the first level 200). The relatively high-complexity and / or higher-performance second-level keyword detection model can be used to provide double confirmation of keyword detection (e.g., by verifying or confirming the keyword detection of the first level) or to reject the first-level keyword detection as a false positive (e.g., to invalidate the keyword detection of the first level).
[0052] In some cases, the first stage 200 of keyword detection can be implemented using relatively low-power circuitry (such as DSPs, codec circuits, etc.). When a keyword is detected, the second stage 214 can be activated, which can handle more complex tasks, such as more free-form word recognition, command detection, task execution, etc. In some cases, the second stage can be executed on relatively high-power circuitry, such as processors, GPUs, ML / AI processors, etc.
[0053] like Figure 2As illustrated, received audio samples (e.g., audio samples from audio source 202) are processed by feature generator 204 of keyword detection first stage 200. Feature generator 204 can be, for example, a hardware-implemented Fourier transform, such as a Fast Fourier Transform (FFT) function or circuit. Fourier transform is typically a function used to deconstruct a time-domain representation of a signal (such as a received audio signal) into a frequency-domain representation. The frequency-domain representation may include voltages or power present at different frequencies in the received audio signal. In some cases, feature generator 204 may generate a set of features, such as feature vectors, based on these representations. This set of features can be output to keyword detector 208. (The last sentence appears to be incomplete and possibly refers to a different context.) Figure 3 An example of feature generator 204 is described in more detail. It is worth noting that other or additional forms of feature generation can be used in other aspects and examples.
[0054] Keyword detector 208 can use keyword detection model 212 to determine whether the received audio signal contains a portion of a keyword. In some cases, keyword detector 208 can accept tens to hundreds of audio frames per second as input, and keyword detector 208 can attempt to detect portions of keywords in the audio signal. In some cases, keyword detection model 212 of keyword detector 208 can be part of a multi-level speech recognition system.
[0055] After keyword detector 208 determines that a keyword has been detected in the received audio signal, keyword detector 208 generates a signal for second level 214. For example, the detected keyword may cause an application to launch, or wake up another part of the electronic device (e.g., screen, other processor, or other sensors), run a query locally or at a remote data service, perform additional speech recognition processing, etc. In some aspects, second level 214 may receive an indication that a keyword has been detected, while in other aspects and / or examples, second level 214 may receive additional information, such as information specific to the detected keyword, such as one or more detected keywords in voice activity. It is worth noting that additional functions (not shown) may exist between keyword detector 208 and second level 214, such as an additional level for keyword activity detection or analysis.
[0056] Figure 3 A feature generator 300 (such as...) is described according to various aspects of this disclosure. Figure 2 The feature generator 300 is an example of feature generator 204. It should be understood that many techniques can be used to generate feature vectors for audio, and feature generator 300 is merely a single example of techniques that can be used to generate feature vectors.
[0057] Feature generator 300 receives audio signals at signal preprocessor 302. As described above, the audio signals may come from an audio source of an electronic device, such as a microphone, such as audio source 202.
[0058] The signal preprocessor 302 can perform various preprocessing steps on the received audio signal. For example, the signal preprocessor 302 can split the audio signal into parallel audio signals and delay one of the signals by a predetermined amount of time in preparation for inputting the audio signal into the FFT circuit.
[0059] As another example, the signal preprocessor 302 may execute windowing functions, such as Hamming, Hann, Blackman-Harris, Kaiser-Bessel window functions, or other sinusoidal window functions, which can improve the performance of further processing stages, such as the signal domain transformer 304. Generally, the windowing (or windowing) functions can be used to reduce the magnitude of discontinuities at the boundaries of each finite sequence of the received audio signal data to improve further processing.
[0060] As another example, signal preprocessor 302 can convert audio signal data from parallel to serial, or vice versa, for further processing. The preprocessed audio signal generated by signal preprocessor 302 can be provided to signal domain transformer 304, which can transform the preprocessed audio signal from a first domain to a second domain, such as from the time domain to the frequency domain.
[0061] In some respects, the signal domain transformer 304 implements Fourier transforms, such as the Fast Fourier Transform (FFT). For example, in some cases, the Fast Fourier Transform can be a 16-band (or frequency bin, channel, or point) FFT, which generates a compact feature set that can be efficiently processed by the model. In some cases, the Fourier transform provides fine spectral domain information about the incoming audio signal compared to conventional single-channel processing, such as conventional hardware SNR threshold detection. The result of the signal domain transformer 304 is a set of audio features, such as a set of voltages, power, or energy for each frequency band in the transformed data.
[0062] The set of audio features can then be provided to a signal feature filter 306, which can reduce the size of the feature set in the audio feature data or compress the feature set. In some aspects, the signal feature filter 306 can discard certain features from the audio feature set, such as symmetrical or redundant features from multiple frequency bands of a multi-band FFT. Discarding this data reduces the overall size of the data stream for further processing and can be referred to as compressing the data stream.
[0063] For example, in some cases, since the audio signal is a real number, a 16-band FFT may include eight symmetrical or redundant frequency bands after the power is squared. Therefore, the signal feature filter 306 can filter out redundant or symmetrical frequency band information and output an audio feature vector 308. In some cases, the output of the signal feature filter may be compressed or otherwise processed before being output as the audio feature vector 308.
[0064] The audio feature vector 308 can be provided to the keyword detector for use by keyword detection models (such as...). Figure 2 The keyword detector 208 and keyword detection model 212 shown are used for processing.
[0065] In some cases, voice detection models (such as keyword detection model 212) can be implemented in SoC 100 and / or its components (such as... Figure 1 It is executed on a DSP 106 and / or NPU 108. In some cases, the voice detection model can be a machine learning model or system.
[0066] Machine learning (ML) can be considered a subset of artificial intelligence (AI). ML systems can include algorithms and statistical models that computer systems can use to perform various tasks through pattern-dependent inference without explicit instructions. An example of an ML system is a neural network (also known as an artificial neural network), which can include groups of interconnected artificial neurons (e.g., neuron models). Neural networks can be used in a variety of applications and / or devices, such as speech analysis, audio signal analysis, image and / or video decoding, image analysis and / or computer vision applications, Internet Protocol (IP) cameras, Internet of Things (IoT) devices, autonomous vehicles, service robots, and more.
[0067] Individual nodes in a neural network mimic biological neurons by taking input data and performing simple operations on that data. The results of these simple operations on the input data are selectively passed to other neurons. Weights are associated with each vector and node in the network, and these values constrain how the input data relates to the output data. For example, the input data of each node can be multiplied by its corresponding weight value, and the products can be summed. The sum of the products can be adjusted with optional biases, and activation functions can be applied to the results to produce the node's output signal or "output activation" (sometimes called a feature map or activation map). The weights can initially be determined by an iterative stream of training data through the network (e.g., weights are established during training phases where the network learns how to identify a particular category based on the characteristics of its typical input data).
[0068] There are different types of neural networks, such as Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Generative Adversarial Networks (GANs), Multilayer Perceptron (MLP) neural networks, Transformer Neural Networks, and so on. For example, a Convolutional Neural Network (CNN) is a feedforward artificial neural network. A CNN may consist of a collection of artificial neurons, each with its own receptive field (e.g., a localized region of the input space) that collectively tile the input space. RNNs work on the principle of storing the layer's output and feeding that output back to the input to help predict the layer's outcome. A GAN is a generative neural network that learns patterns in the input data so that the neural network model can generate new synthetic outputs that are reasonably likely derived from the original dataset. A GAN may consist of two neural networks operating together: a generative neural network that generates the synthetic output and a discriminative neural network that evaluates the authenticity of the output. In an MLP neural network, data is fed into the input layer, and one or more hidden layers provide an abstraction level to the data. The output layer can then be predicted based on this abstract data.
[0069] Deep learning (DL) is an example of machine learning techniques and can be considered a subset of ML. Many DL methods are based on neural networks, such as RNNs or CNNs, and utilize multiple layers. Using multiple layers in a deep neural network allows for the progressive extraction of higher-level features from a given raw data input. For example, the output of the first layer of artificial neurons becomes the input of the second layer, the output of the second layer becomes the input of the third layer, and so on. The layers located between the input and output of the entire deep neural network are often called hidden layers. Hidden layers learn (e.g., are trained) by transforming intermediate inputs from previous layers into slightly more abstract and complex representations that can be provided to subsequent layers until the final or desired representation is obtained as the final output of the deep neural network.
[0070] As noted above, neural networks are examples of machine learning systems and can include an input layer, one or more hidden layers, and an output layer. Data is provided from input nodes in the input layer, processed by hidden nodes in one or more hidden layers, and output is produced by output nodes in the output layer. Deep learning networks typically include multiple hidden layers. Each layer of a neural network can include a feature map or activation map, which can include artificial neurons (or nodes). Feature maps can include filters, kernels, etc. Nodes can include one or more weights used to indicate the importance of nodes in one or more layers. In some cases, deep learning networks may have a series of many hidden layers, where earlier layers are used to determine simple and low-level properties of the input, and later layers build a hierarchy of more complex and abstract properties.
[0071] Deep learning architectures can learn hierarchical structures of features. For example, if presented with visual data, the first layer can learn to recognize relatively simple features in the input stream, such as edges. In another example, if presented with auditory data, the first layer can learn to recognize spectral power at specific frequencies. The second layer, taking the output of the first layer as input, can learn to recognize combinations of features, such as simple shapes in visual data or combinations of sounds in auditory data. For example, higher layers can learn to represent complex shapes in visual data or words in auditory data. Even higher layers can learn to recognize common visual objects or spoken phrases. Deep learning architectures perform particularly well when applied to problems with natural hierarchical structures. For example, the classification of motorized vehicles can benefit from first learning to recognize features such as wheels, windshields, and others. These features can then be combined in different ways at higher layers to identify cars, trucks, and airplanes.
[0072] Figures 4A to 4C Example neural networks for keyword detection according to various aspects of this disclosure are illustrated. Neural networks can be designed to have multiple connectivity patterns. In feedforward networks, information is passed from lower layers to higher layers, where each neuron in a given layer communicates with neurons in higher layers. As described above, hierarchical representations can be constructed in successive layers of a feedforward network. Neural networks can also have recurrent or feedback (also known as top-down) connections. In recurrent connections, the output from a neuron in a given layer can be communicated to another neuron in the same layer. Recurrent architectures can help identify patterns across more than one block of input data that is sequentially delivered to the neural network. Connections from neurons in a given layer to neurons in lower layers are called feedback (or top-down) connections. Networks with many feedback connections can be helpful when the recognition of higher-level concepts can aid in discerning specific lower-level features of the input.
[0073] In some cases, the connections between layers of a neural network can be fully connected or locally connected. Figure 4A An example of a fully connected neural network 402 is illustrated. In the fully connected neural network 402, neurons in the first layer can transmit their outputs to each neuron in the second layer, such that each neuron in the second layer receives input from each neuron in the first layer. Figure 4BAn example of a locally connected neural network 404 is illustrated. In the locally connected neural network 404, neurons in a first layer can connect to a finite number of neurons in a second layer. More generally, the locally connected layers of the locally connected neural network 404 can be configured such that each neuron in the layer will have the same or similar connectivity pattern, but the connection strength can have different values (e.g., 410, 412, 414, and 416). The connectivity pattern of locally connected layers can produce spatially dissimilar receptive fields in higher layers because neurons in higher layers in a given region can receive inputs that are trained to the properties of a restricted portion of the network's total input.
[0074] An example of a locally connected neural network is a convolutional neural network. Figure 4C An example of a convolutional neural network 406 is illustrated. Convolutional neural network 406 can be configured such that the connection strength associated with the input of each neuron in the second layer is shared (e.g., 408). Convolutional neural networks may be well-suited for problems where the spatial location of the input is meaningful.
[0075] Figure 5 This is a block diagram illustrating an example of a deep convolutional network (DCN) 550 according to various aspects of this disclosure. The DCN 550 may include multiple layers of different types based on connectivity and weight sharing. Figure 5 As shown, DCN 550 includes convolutional blocks 554A and 554B. Each convolutional block in convolutional blocks 554A and 554B can be configured with a convolutional layer (CONV) 556, a normalization layer (LNorm) 558, and a max pooling layer (MAX POOL) 560.
[0076] Convolutional layer 556 may include one or more convolutional filters that can be applied to input data 552 to generate feature maps. Although only two convolutional blocks 554A and 554B are shown, this disclosure is not limited thereto, and any number of convolutional blocks (e.g., blocks 554A and 554B) may be included in DCN 550 according to design preferences. Normalization layer 558 may normalize the output of the convolutional filters. For example, normalization layer 558 may provide whitening or lateral suppression. Max pooling layer 560 may provide spatial downsampling aggregation to achieve local invariance and dimensionality reduction.
[0077] For example, the parallel filter bank of a deep convolutional network can be loaded onto the CPU 102 or GPU 104 of the SOC 100 to achieve high performance and low power consumption. In some examples, the parallel filter bank can be loaded onto the DSP 106 or ISP 116 of the SOC 100. Additionally, the DCN 550 can access other processing blocks that may exist on the SOC 100, such as the sensor processor 114 dedicated to sensors and navigation, and the keyword detection system 120, respectively.
[0078] The deep convolutional network 550 may also include one or more fully connected layers, such as layer 562A (labeled "FC1") and layer 562B (labeled "FC2"). The DCN 550 may also include a logistic regression (LR) layer 564. Between each layer 556, 558, 560, 562A, 562B, 564 of the DCN 550 are weights (not shown) to be updated. The output of each of these layers (e.g., 556, 558, 560, 562A, 562B, 564) can serve as the input to the next layer in these layers (e.g., 556, 558, 560, 562A, 562B, 564) of the deep convolutional network 550 to learn hierarchical feature representations from the input data 552 (e.g., images, audio, video, sensor data, and / or other input data) provided at the initial convolutional block 554A.
[0079] To adjust the weights, the learning algorithm computes the gradient vector of the weights. The gradient indicates by how much the error will increase or decrease as the weights are adjusted. At the top layers, the gradient corresponds directly to the values of the weights connecting the activated neurons in the penultimate layer to the neurons in the output layer. In lower layers, the gradient depends on the values of the weights and the error gradient computed in the higher layers. The weights can then be adjusted to reduce the error. This method of adjusting weights is called "backpropagation" because it involves "passing backward" through the neural network.
[0080] In practice, the error gradient of the weights can be computed in a small number of examples to make the computed gradient approximate the true error gradient. This approximation method is called stochastic gradient descent. Stochastic gradient descent can be repeated until the achievable error rate of the entire system stops decreasing or until the error rate reaches the target level. After learning, new inputs can be presented to the DCN, and the forward pass of the network can produce an output that can be considered an inference or prediction of the DCN.
[0081] The output of DCN 550 is a classification score 566 of the input data 552. The classification score 566 can be a probability or a set of probabilities, where a probability is the probability that the input data includes features from the feature set trained by DCN 550 to detect.
[0082] In some cases, ML systems or models can be used to analyze each audio frame to determine whether a voice command is likely to be present. For keyword detection, the output of the ML network (such as a probability) can be referred to as a frame score. This frame score indicates the likelihood that a frame contains one or more parts of a voice command (such as a keyword). As an example, in the case of keyword detection responding to the keyword "hey device," the first audio frame may have an audio signal that includes the sound corresponding to "he." Compared to another audio frame that does not have an audio signal that includes the sound corresponding to "hey device," the ML network should output a higher frame score for the first audio frame. While this paper discusses ML systems in the context of ML systems, in some cases, non-ML techniques can be used to analyze audio frames to generate frame scores and determine whether a voice command is likely to be present. For example, Gaussian mixture models (GMMs), Hidden Markov models (HMMs) (GMM-HMM), Dynamic Time Warping (DTW), and / or other processes using Gaussian acoustic models and / or N-gram language models, such as phoneme likelihood estimation, Viterbi decoding, etc. These non-ML techniques may also be skipped based on the techniques discussed in this paper.
[0083] As previously noted, this paper describes systems and techniques for providing keyword detection systems that can be used to perform multi-level keyword detection with improved detection performance and reduced latency. For example, these systems and techniques can be used to improve detection performance and / or reduce latency associated with multi-level keyword detection by determining refined keyword start and / or end indices based on speech rate classification information corresponding to input audio samples initially detected as keywords. By refining the keyword start and / or end indices to be (correspondingly) closer to actual (e.g., baseline truth) keyword start and / or end indices, detection performance and / or latency can be improved by reducing the portion of audio samples buffered (e.g., using a keyword buffer) for processing by the second-level keyword detection model. For example, in an example where incorrect or inaccurate estimates of keyword start or end indices are used in the first keyword detection level, refinement of the keyword start and / or end indices can be used to reduce keyword truncation from the first keyword detection level. By determining and using refined keyword start and / or end indices after the first keyword estimation level, a more accurate keyword buffer can be provided to the second keyword detection level downstream (e.g., after) of the first keyword detection level. In an example where the truncated keyword buffer is passed to a second keyword detection level (e.g., without keyword start and / or end index refinement), the second keyword detection level may generate a rejection based on a portion of the keyword discourse within the truncated keyword buffer.
[0084] Figure 6This is an example illustrating keyword audio samples 600 and the estimated start and end indices of the keywords detected in the audio samples, as well as a diagram illustrating the baseline truth start and end indices of the keywords in the audio samples.
[0085] For example, the estimated keyword starting index 632 corresponds to the keyword starting time estimated by the keyword starting estimation machine learning network and / or by the first level of the multi-level keyword detection system. For example, the estimated keyword starting index 632 can be used with... Figure 2 The first level of keyword detection (level 200) determines whether the keywords are identical or similar to each other. In some examples, the estimated keyword starting index (632) can be used with... Figure 8 The keyword initial estimation network 816 included in the keyword detection and initial estimation system (e.g., keyword processing level 1) 820 is the same as or similar to the keyword initial estimation network 816.
[0086] The estimated keyword termination index 634 may correspond to the estimated keyword termination time within the audio sample 600. In some examples, the estimated keyword termination index 634 may be the same as the timestamp of performing the initial first-level keyword detection. For example, the estimated keyword termination index 634 may be a timestamp corresponding to the initial keyword detection output generated by the first keyword processing level, indicating successful initial detection of keywords within the audio sample 600. In some examples, multi-level keyword processing and detection may be performed without explicit processing to determine or identify the keyword termination time (e.g., the estimated keyword termination index 634).
[0087] The estimated keyword starting index 632 may be later than the actual (e.g., the baseline truth) keyword starting index 602, which corresponds to the time index at which the keyword first begins to be spoken or represented in the audio sample 600. For example, the difference between the estimated keyword (KW) starting index 632 and the actual KW starting index 602 may represent an error or difference in the keyword starting estimation performed by the first keyword processing level.
[0088] The estimated KW termination index 634 may be earlier than the actual (e.g., the baseline truth) KW termination index 604, which corresponds to the time index at which the keyword is no longer spoken or represented in the audio sample 600. The estimated KW termination index 634 may be earlier than the actual KW termination index 604 because the estimated KW termination index 634 is the same as the timestamp at which keyword detection reaches the detection confidence threshold level that allows the keyword detection process to exit. For example, in an example where keyword detection can be performed using only a portion of the complete keyword spoken in the input audio sample, the first keyword detection level may exit earlier (e.g., relative to the actual KW termination index 604), and the estimated KW termination index 634 will be earlier. The difference between the estimated KW termination index 634 and the actual KW termination index 604 may represent an error or difference in the estimation of the keyword termination time of the audio sample 600.
[0089] As previously noted, at least the first keyword detection level (e.g., such as...) can be used. Figure 2 Keyword detection level 1 (200) and second keyword detection level (e.g., such as Figure 2 The second level (214) is used to perform multi-level keyword detection. The first keyword detection level can be configured to perform initial keyword detection and keyword starting index estimation. The second keyword detection level can be configured to achieve better and / or more accurate keyword detection on audio samples corresponding to the keywords from the initial detection of the first level.
[0090] In some cases, the first keyword detection level may use the estimated KW start index 632 to configure the keyword buffer to include relevant portions of the audio sample 600 corresponding to keywords detected by the first keyword detection level. For example, the first keyword detection level may process the audio sample 600 to perform initial keyword detection of configured keywords. At time index 634 within the audio sample 600, the first keyword detection level determines that a keyword has been detected within the audio sample 600 and sets the keyword detection time to be equal to the estimated KW end index 634.
[0091] Subsequently (for example, after a keyword is detected), the first keyword detection level can perform keyword initial estimation to determine the estimated KW initial index 632 of the keywords detected within the audio sample 600.
[0092] Using the estimated KW start index 632 and the estimated KW end index 634, the first keyword detection level can configure the keyword buffer to include a portion of the audio sample 600 between the estimated KW start index 632 and the estimated KW end index 634, and pass the keyword buffer to the second keyword detection level for processing.
[0093] In the prior art, the estimated KW start and / or end index often deviates from the actual, benchmark true KW start and / or end index. The keyword buffer between the first keyword detection level and the second keyword detection level can be filled with audio data of the audio sample 600 that is outside the estimated KW start index 632 (e.g., earlier than the estimated KW start time 632) and / or outside the estimated KW end index 634 (e.g., later than the estimated KW end time 634).
[0094] For example, configured and / or static padding values can be used to perform keyword buffer padding between keyword detection levels, which indicate audio data of audio sample 600 that is outside the estimated KW indices 632, 634 but should be included in the keyword buffer provided to the second keyword detection level for an additional time range or time window.
[0095] In some examples, static values configured for the start and / or end indices of the keyword buffer may be inefficient and increase the latency of the keyword detection system, such as when the buffer fill value is greater than the actual error between the estimated and actual start indices 632 and 602 (corresponding) and / or greater than the actual error between the estimated and actual end indices 634 and 604 (corresponding). Processing additional audio data within the filled portion of the keyword buffer may increase the latency of the second keyword detection level. For example, an overfilled keyword buffer may have a post-filled KW start time earlier than the actual, baseline truth start time 602 of the keyword within audio sample 600. The second keyword detection level may begin processing from the start of the filled keyword buffer (e.g., the earliest audio data within the keyword buffer, corresponding to a timestamp before the actual KW start time 602). Processing “invalid” audio data occurring even before the true start time 602 of the keyword using the second keyword detection level may be inefficient and increase the overall latency of the multi-level keyword detection system.
[0096] In other examples, the static values of the configuration used to populate the keyword buffer start and / or end indices may be less than the estimated actual errors between (correspondingly) the actual KW start indices 632 and 602 and / or less than the estimated actual errors between (correspondingly) the actual KW end indices 634 and 604. In some cases, the estimated KW start index 632 may be more likely to deviate significantly from the baseline true KW start index 602, and the populated keyword buffer may not include all audio data within audio sample 600 that corresponds to (e.g., between) the actual keyword start index 602 and the actual keyword end index 604.
[0097] For example, if the static value configured to populate the keyword buffer is less than the error in at least the keyword start index estimation (e.g., estimated KW start 632 – actual KW start 602), the populated keyword buffer will only include a partial representation of the spoken keywords within audio sample 600 (e.g., the beginnings of the spoken keywords within audio sample 600 are cut off from the audio data written to the keyword buffer and processed by the second keyword detection level). This incomplete representation of the spoken keyword audio data in the keyword buffer provided to the second keyword detection level may cause the second keyword detection level to incorrectly reject keyword detection at the first level (e.g., partial phrase rejection based on the truncation of the keyword audio data within the keyword buffer). Partial phrase rejection and / or rejection of first-level keyword detection may degrade keyword detection performance and harm the user experience, as the keyword will not be recognized within the current audio sample 600 and must be spoken again in the future to trigger the desired action.
[0098] In one illustrative example, the system and techniques described herein can be used to achieve improved keyword detection (e.g., including multi-level keyword detection) based on refinement using estimated keyword lengths derived from speech rate classification information. For example, the estimated keyword start index (e.g., such as estimated KW start index 632) and / or estimated keyword end index (e.g., such as estimated KW end index 634) can be fine-tuned or refined to obtain keyword buffer audio data corresponding to the speech rate associated with spoken keywords within audio samples 600.
[0099] Figure 7 This is a diagram illustrating various speech rates 700 that can be associated with and / or categorized by spoken keywords within an input audio sample. For example, the first audio sample 710 corresponds to an example of keywords spoken by a fast-paced speaker, with the spoken keywords being approximately 0.585 seconds long.
[0100] The second audio sample 720 corresponds to an example of the same keyword spoken by a speaker at a normal speaking speed, with the spoken keyword length being approximately 0.914 seconds.
[0101] The third audio sample 730 corresponds to an example of the same keyword spoken by a slow-speed speaker, with the spoken keyword being approximately 1.226 seconds long.
[0102] In some cases, the estimated length information for a specific keyword may differ for different speakers, individuals, users, etc. For example, audio samples 710, 720, and 730 all correspond to the same spoken keyword, but the difference in spoken keyword length between a fast speaker (e.g., audio sample 710) and a slow speaker (e.g., audio sample 730) is approximately 0.641 seconds, because the spoken keyword length of the slow speaker in audio sample 730 is more than 100% longer than the spoken keyword length of the same keyword uttered by the fast speaker in audio sample 710.
[0103] Figure 8 This is a block diagram illustrating an example keyword detection system 800 that uses keyword length refinement based on speech rate classification information, based on some examples.
[0104] Keyword detection system 800 can be used to process and perform keyword detection on audio data corresponding to input speech 802 (e.g., also referred to as "input audio sample 802"). Keyword detection and initial estimation system 820 may include keyword detection machine learning network 812 (e.g., a first neural network, etc.) and keyword initial estimation machine learning network 816 (e.g., a second neural network, etc.). In some aspects, keyword detection and initial estimation system 820 may be implemented as or by a first keyword detection level (e.g., and keyword detection system 800 may be included in or provided as part of a multi-level keyword detection system). For example, keyword detection and initial estimation system 820 may be used with... Figure 2 The first keyword detection level is the same as or similar to 200.
[0105] The keyword detection network 812 can be used to process the input audio sample 802 and perform initial keyword detection indicating the specific keywords detected within the input audio sample 802. The specific keywords detected by the keyword detection network 812 may be configuration keywords associated with the keyword detection system 800. In some examples, the specific keywords detected by the keyword detection network 812 may be included in one or more configuration keywords among those detected by the keyword detection system 800.
[0106] Based on the detection of specific keywords within the input audio sample 802 using the keyword detection network 812 and / or the keyword detection and start estimation system 820, the keyword start estimation network 816 can be used to determine the estimated start time of the detected keywords within the input audio sample. For example, the keyword start estimation network 816 can generate an estimated keyword start index 825, which may include... Figure 6 The estimated keyword starting index 632 is the same or similar estimated keyword starting time index.
[0107] In some cases, the estimated keyword termination time index associated with the estimated KW start index 825 can be related to... Figure 6 The estimated keyword termination index 634 is the same as or similar to the estimated keyword start index 834. In some examples, the keyword start estimation network 816 can be used to generate an estimated keyword start time index including the estimated KW start index 825, and the time of the keyword detection output generated by the keyword detection network 812 can be used as an estimated keyword termination time index associated with the estimated KW start index 825.
[0108] Based on the detection of specific keywords within input audio sample 802 (e.g., using keyword detection network 812), these systems and techniques can use speech rate classification network 830 to process input audio sample 802 in parallel to determine speech rate information 835 indicating the speech rate (e.g., fast, normal, slow, etc.) of spoken keywords within input audio sample 802. In some cases, speech rate classification network 830 may be a trained machine learning classification network used to classify input audio samples (e.g., such as input audio sample 802) into one of several different speech rate classifications, such as fast speech rate / speaker, normal speech rate / speaker, slow speech rate / speaker, etc.
[0109] The speech rate classification network 830 can analyze the input audio sample 802 in parallel with the keyword detection and start estimation system 820. In an exemplary example, the speech rate classification network 830 can analyze the input audio sample 802 to generate speech rate information 835, in parallel with the estimated keyword start index 825 generated using the keyword detection and start estimation system (e.g., a first keyword detection level). For example, the keyword detection network 812 can be configured to run continuously to perform keyword detection that is always on.
[0110] The keyword detection of the keyword detection network 812 in the first keyword processing stage 820 can trigger the speech rate classification network 830 and the keyword initial estimation network 816 to start processing the input audio sample 802, wherein the processing of the input audio sample 802 by the network 830 and the network 816 can be performed in parallel.
[0111] In some aspects, the keyword detection system 800 may include a keyword index refinement engine 840, which is configured to generate refined keyword start and end indexes 845 based on an estimated keyword start index 825, speech rate classification information 835, and further based on an average keyword length value 842.
[0112] The average keyword length value 842 may correspond to a specific keyword detected as a spoken word within the input audio sample 802 (e.g., the average keyword length value 842 corresponds to a specific keyword detected by the keyword detection network 812). The average keyword length value 842 may include the average spoken length of a specific keyword for each corresponding speech rate category included in the speech rate classification output space of the speech rate classification network 830.
[0113] For example, if the possible output classifications of different speech rates indicated by the speech rate classification network 830 in the speech rate information 835 include {fast; normal; slow}, then the average keyword length value 842 for a specific keyword detected by the keyword detection network 812 within the input audio sample 802 may include {average speech length KW for fast speech rate classification; average speech length KW for normal speech rate classification; average speech length KW for slow speech rate classification}.
[0114] In some aspects, an average keyword length value 842 may be determined for each corresponding speech rate classification and for each corresponding keyword among a plurality of keywords that the keyword detection system 800 is configured to detect and / or identify. The average keyword length value 842 may be determined using one or more offline datasets comprising multiple samples of multiple keywords uttered by speakers with different corresponding speech rate classifications.
[0115] In an exemplary example, the average keyword length value 842 may be embedded in a machine learning model (e.g., a neural network model) used to implement the keyword indexing refinement engine 840 and / or to implement the speech rate classification network 830. For example, the average keyword length value 842 may be embedded in model metadata used to configure or initialize the underlying trained machine learning model (e.g., a trained neural network model) associated with the keyword detection system 800.
[0116] Based on speech rate information 835 (indicating whether the speaker of the spoken keywords detected in the input audio sample 802 is a fast, normal, or slow speaker), the keyword index refinement engine 840 can generate refined keyword start and end indices 845 by adjusting (e.g., padding) one or more (or both) of the estimated KW start index 825 and / or the estimated keyword end index associated with the keyword start index 825. For example, the adjustment or padding of the estimated KW start and end indices can be based on the detected speech rate classification 835 determined by the speech rate classification network 830.
[0117] For example, Figure 9 This is an example of what can be used to implement based on some examples. Figure 8A block diagram 900 shows an example of a speech rate classification network 930 and a keyword indexing refinement engine 940 in a keyword detection system 800. In some aspects, the speech rate classification network 930 can be integrated with... Figure 8 The speech rate classification network 830 is the same as or similar to it. In some cases, the keyword indexing refinement engine 940 can be compared with... Figure 8 The keyword indexing refinement engine is the same as or similar to 840. In some examples, the average keyword length value of 942 can be compared to... Figure 8 The average keyword length value of 842 is the same as or similar to that of other keywords. In some cases, the estimated keyword starting index of 925 may be similar to... Figure 8 The estimated keywords starting index 825 are the same or similar.
[0118] The average keyword length value 942 can be the offline average keyword length determined for a specific keyword detected (e.g., a specific keyword associated with the estimated keyword start and end index 925) and across the classification space of the speech rate classification network for each speech rate classification.
[0119] For example, in one instance, if the speech rate classification network 930 includes an output classification space for slow speech rate (e.g., corresponding to processing flow 940-1 within the keyword index refinement engine 940), normal speech rate (e.g., corresponding to processing flow 940-2 within the keyword index refinement engine 940), and fast speech rate (e.g., corresponding to processing flow 940-3 within the keyword index refinement engine 940), then the average keyword length value 942 can correspond to the average slow, normal, and fast speech lengths of the detected keywords.
[0120] In an illustrative example, the average keyword length value 942 may include an offline estimated average keyword length, denoted as L, from a slow-moving speaker uttering the specific detected keyword. s The offline estimated average keyword length of a normal speaker who utters the specific keyword detected is denoted as L. n ; and the offline estimated average keyword length of a fast speaker who utters the specific keywords detected, denoted as L. f .
[0121] For example, the average keyword length L of slow speakers s Can be with Figure 7 The keyword length of slow speakers is 730, which is the same or similar. The average keyword length of normal speakers is L. n Can be with Figure 7 The keyword length of normal speakers is 720 or similar. The average keyword length of fast speakers is L. f Can be with Figure 7 The keywords of the fast speaker are the same or similar in length (710).
[0122] In some respects, Figure 9 The keyword index refinement engine 940 can receive estimated keyword start and end indices 925 of the currently detected keywords as input (e.g., the same keyword corresponding to the average keyword length value 942, which can be embedded in the model metadata of the keyword index refinement engine 940 and / or speech rate classification network 930). The estimated keyword start and end indices 925 can be obtained from the first keyword detection level, such as... Figure 8 Keyword detection and initial estimation system 820 Figure 2 The first keyword detection level is 200, etc.
[0123] The keyword index refinement engine 940 can determine the estimated length of the spoken keywords detected by the estimated keyword start and end indexes 925 from the first keyword detection level. For example, the keyword index refinement engine 940 can calculate the estimated spoken keyword length as L. est =Estimated KW End Timestamp - Estimated KW Start Timestamp. The estimated KW end timestamp and estimated KW start timestamp can be the same as the estimated KW start and end indices 925.
[0124] Based on the speech rate classification network 930, speech rate classification information indicating slow speech rate is generated (e.g., such as...). Figure 8 Speech rate classification information 835), keyword index refinement engine 940 can use processing flow 940-1 based on average slow speaker keyword length information 942 embedded in model metadata. s This allows for refining or fine-tuning the estimated KW start and end indices 925. For example, the keyword index refinement engine 940 can calculate the keyword length for slow-paced fine-tuning as max{L est , L s The estimated KW start and / or estimated KW end index 925 can be adjusted using this fine-tuning keyword length. For example, if the estimated verbal keyword length L is determined from the initial KW start and end index 925... est Longer than the average length of keywords spoken by slow speakers (L) s Then the initial estimated KW start and end index 925 can be used as the refined KW start and end index output by the keyword index refinement engine 940 (e.g., compared with the indexes provided by the keyword index refinement engine 940). Figure 8 The keyword index refinement engine 840 outputs refined KW (starting and ending indexes 845 are the same or similar). If the estimated verbal keyword length L comes from the initial index estimate 925... est Shorter than the average length of spoken keywords for slow speakers (L) s Then the initial estimated start and end indices of KW can be adjusted to 925, such that (refined KW end index) - (refined KW start index) = Ls .
[0125] In some examples, the refined KW start and end indices can each be offset from their respective initial KW start and end indices 925 by an equal amount of time. For example, the refined KW start index can be offset to be an amount earlier than the initial KW start index 925, that amount being equal to... Furthermore, the refined KW terminating index can be offset to a certain amount later than the initial KW terminating index 925, which is also equal to... (For example, refining keywords increases the total length by L) s -L est And refine the keyword length to L s In some cases, the estimated KW start and end indices 925 can be adjusted disproportionately or unequally to generate corresponding refined KW start and end indices. For example, the refined KW start index can be shifted to be an amount earlier than the initial KW start index 925, which is equal to... Furthermore, the refined KW terminating index can be offset to a certain amount later than the initial KW terminating index 925, which is equal to... The total length of refined keywords increases by L. s -L est And refine the keyword length to L s In an exemplary example, the weighting parameter N can be between 0.5 and 1 (e.g., because the initial KW terminating index estimate may be more accurate than the initial KW starting index estimate of index 925).
[0126] In another example, speech rate classification information indicating normal speech rate is generated based on the speech rate classification network 930 (e.g., such as...). Figure 8 Speech rate classification information 835), keyword index refinement engine 940, usable processing flow 940-2 based on average normal speaker keyword length information 942 embedded in model metadata. n This allows for refining or fine-tuning the estimated KW start and end indices 925. For example, the keyword index refinement engine 940 can calculate the keyword length as max{L} at normal speaking speed. est , L n The estimated KW start and / or estimated KW end index 925 can be adjusted using this fine-tuning keyword length. For example, if the estimated verbal keyword length L is determined from the initial KW start and end index 925... est Longer than the average length of spoken keywords by a normal speaker (L) n Then the initial estimated KW start and end index 925 can be used as the refined KW start and end index output by the keyword index refinement engine 940 (e.g., compared with the indexes provided by the keyword index refinement engine 940). Figure 8The keyword index refinement engine 840 outputs refined KW start and end indexes 845 that are the same or similar. If the estimated verbal keyword length L comes from the initial KW index estimate 925... est Shorter than the average length of spoken keywords by a normal speaker (L) n Then the initial estimated start and end indices of KW can be adjusted to 925, such that (refined KW end index) - (refined KW start index) = L n .
[0127] In some examples, the refined KW start and end indices can each be offset from their respective initial KW start and end indices 925 by an equal amount of time. For example, the refined KW start index can be offset to be an amount earlier than the initial KW start index 925, that amount being equal to... Furthermore, the refined KW terminating index can be offset to a certain amount later than the initial KW terminating index 925, which is also equal to... (For example, refining keywords increases the total length by L) n -L est And refine the keyword length to L n In some cases, the estimated KW start and end indices 925 can be adjusted disproportionately or unequally to generate corresponding refined KW start and end indices. For example, the refined KW start index can be shifted to be an amount earlier than the initial KW start index 925, which is equal to... Furthermore, the refined KW terminating index can be offset to a certain amount later than the initial KW terminating index 925, which is equal to... The total length of refined keywords increases by L. n -L est And refine the keyword length to L n In an exemplary example, the weighting parameter N can be between 0.5 and 1 (e.g., because the initial KW terminating index estimate may be more accurate than the initial KW starting index estimate of index 925).
[0128] In another illustrative example, speech rate classification information indicating a fast speech rate is generated based on the speech rate classification network 930 (e.g., ...). Figure 8 Speech rate classification information 835), keyword index refinement engine 940, usable processing flow 940-3 based on average fast speaker keyword length information 942 embedded in model metadata, average fast speaker keyword length L f To refine or fine-tune the estimated KW start and end indexes 925. For example, the keyword index refinement engine 940 can quickly adjust the keyword length calculation to max{L est , L fThe estimated KW start and / or estimated KW end index 925 can be adjusted using this fine-tuning keyword length. For example, if the estimated verbal keyword length L is determined from the initial KW start and end index 925... est Longer than the average length of keywords spoken by fast speakers (L) f Then the initial estimated KW start and end index 925 can be used as the refined KW start and end index output by the keyword index refinement engine 940 (e.g., compared with the indexes provided by the keyword index refinement engine 940). Figure 8 The keyword index refinement engine 840 outputs refined KW (starting and ending indexes 845 are the same or similar). If the estimated verbal keyword length L comes from the initial index estimate 925... est Shorter than the average length of spoken keywords for fast speakers (L) f Then the initial estimated start and end indices of KW can be adjusted to 925, such that (refined KW end index) - (refined KW start index) = L f .
[0129] In some examples, the refined KW start and end indices can each be offset from their respective initial KW start and end indices 925 by an equal amount of time. For example, the refined KW start index can be offset to be an amount earlier than the initial KW start index 925, that amount being equal to... Furthermore, the refined KW terminating index can be offset to a certain amount later than the initial KW terminating index 925, which is also equal to... (For example, refining keywords increases the total length by L) f -L est And refine the keyword length to L f In some cases, the estimated KW start and end indices 925 can be adjusted disproportionately or unequally to generate corresponding refined KW start and end indices. For example, the refined KW start index can be shifted to be an amount earlier than the initial KW start index 925, which is equal to... Furthermore, the refined KW terminating index can be offset to a certain amount later than the initial KW terminating index 925, which is equal to... The total length of refined keywords increases by L. f -L est And refine the keyword length to L f In an exemplary example, the weighting parameter N can be between 0.5 and 1 (e.g., because the initial KW terminating index estimate may be more accurate than the initial KW starting index estimate of index 925).
[0130] In some respects, the systems and techniques described herein can be used to perform keyword length refinement (e.g., keyword start and / or end index refinement) based on speech rate classification information from the initial audio samples in which specific keywords were detected. Refining the keyword start and end indices (e.g., such as...) Figure 8 Refined KW start and end indexes 845, using Figure 8 The refined KW start and end indexes generated by the keyword index refinement engine 840 can be used to extend the length of estimated keywords to more closely correspond to the actual, baseline truth keyword start and end time indexes. For example, Figure 8 The refined KW start and end index 845 can correspond to a refinement that will Figure 6 The initial estimated KW start and end indices 632 and 634 offsets are used to get closer to... Figure 6 The corresponding baseline truth values KW start and end indices are 602 and 604.
[0131] In some respects, the estimated KW start and end indices are refined or fine-tuned (e.g., Figure 8 The estimated starting index of KW is 825. Figure 9 The estimated KW starting index (925, etc.) can reduce keyword truncation in the keyword buffer information representing detected keywords passed from the first keyword detection level to the second keyword detection level in a multi-level keyword detection system. Based on adding Figure 8 The speech rate classification network 830 and / or Figure 9 The speech rate classification network 930 improves the overall performance of the keyword detection system. This keyword detection system can be configured to only perform keyword detection at the first keyword detection level (e.g., ...). Figure 8 The KW detection network (812) was running when a specific keyword was detected in the input audio sample being analyzed.
[0132] The system and techniques described in this paper for keyword length refinement can also be used to reduce redundant audio data buffering in multi-level keyword detection systems, whether before or after keyword estimation. Refining the KW start and end indices 845 can be shifted to actual, baseline truth KW start and end indices closer to the spoken keywords detected within the current audio sample, and using refined KW start and end indices in subsequent keyword detection and / or keyword processing stages of a multi-level keyword detection system can reduce the processing time and power associated with keyword detection.
[0133] Figure 10This is a block diagram illustrating a process 1000 for processing one or more audio samples. Although the example process 1000 depicts a specific order of operations, this order may be varied without departing from the scope of this disclosure. For example, some of the depicted operations may be performed in parallel or in a different order that does not substantially affect the functionality of process 1000. In other examples, different components of the example device or system implementing process 1000 may perform their functions substantially simultaneously or in a specific order. In some examples, process 1000 may be... Figure 1 100 computing devices Figure 2 Keyword detection system 200 Figure 8 Keyword detection system 800 Figure 9 The keyword detection architecture 900 and other executions.
[0134] At box 1002, process 1000 includes detecting spoken keywords within audio samples in one or more audio samples using a first keyword detection model. For example, the first keyword detection model may be the same as or similar to one or more of the following: Figure 2 Keyword detection model 212 and / or keyword detection level 1 200; Figure 8 Keyword detection network 812; Figure 8 Keyword detection and initial estimation system 820; etc.
[0135] In some cases, audio samples can be obtained from Figure 2 The audio source 202 is obtained, and / or can be used with Figure 3 The audio signals are the same or similar. In some cases, the audio samples may be the same as... Figure 8 The input speech is the same as or similar to 802. In some cases, the one or more audio samples may be included in Figure 8 The input language is 802.
[0136] In some examples, the first keyword detection model is configured to perform always-on keyword detection on one or more audio samples. In some cases, the first keyword detection model may be associated with a speech rate classification machine learning network. For example, a speech rate classification machine learning network may be associated with... Figure 8 Speech speed classification network 830 Figure 9 The speech rate classification network 930 is the same as or similar to other speech rate classification networks. In some cases, the speech rate classification machine learning network is configured to perform speech rate classification on a specific audio sample within one or more audio samples based on spoken keywords detected by a first keyword detection model within that specific audio sample. For example, Figure 8 The speech rate classification machine learning network 830 can be configured to be based on... Figure 8The first keyword detection model 812 detects spoken keywords in the audio samples of the input speech 802 to perform speech rate classification on specific audio samples in the input speech 802.
[0137] At box 1004, process 1000 includes determining an estimated keyword index corresponding to the detection of spoken keywords within the audio sample, the estimated keyword index including an estimated keyword start index and an estimated keyword end index.
[0138] For example, estimating keyword indexes can be done using... Figure 8 Keyword detection and initial estimation system 820 and / or using Figure 8 The keyword starting point is determined by the 816 estimation network. In some examples, the keyword index can be estimated using... Figure 2 The first level of keyword detection is determined by 200. In some cases, estimating the keyword index may include... Figure 6 The estimated starting index is the same as or similar to the estimated starting keyword 632, and / or may include the same as... Figure 6 The index terminates for 634 identical or similar estimated keywords.
[0139] At box 1006, process 1000 includes using a speech rate classification machine learning network to determine speech rate information corresponding to the audio sample. For example, the speech rate classification machine learning network can be... Figure 8 Speech speed classification network 830 Figure 9 The speech rate classification network 930 is the same as or similar to that of other networks.
[0140] In some examples, speech rate information can be compared with... Figure 8 The speech rate information is the same as or similar to 835, and it can be used Figure 8 The speech rate classification network 830 is used to determine this. In some examples, speech rate information can be compared with... Figure 9 The speech rate information is the same as or similar to 940. In some cases, speech rate information can be categorized from slow speech rate (and...). Figure 9 The slow speech information is the same as or similar to 940-1, and the normal speech rate classification is the same as (and similar to). Figure 9 (The information is the same as or similar to the normal speaking speed information 940-2) or the fast speaking speed classification (with) Figure 9 Choose from the same or similar information (940-3) in the rapid speech information.
[0141] In some cases, speech rate information indicates the slow speech rate classification of spoken keywords within an audio sample (e.g., with...). Figure 9 The slow speech rate is the same as or similar to 940-1, and the normal speech rate is classified as (e.g., with) Figure 9 (same or similar to normal speaking speed 940-2) or fast speaking speed classification (e.g., with) Figure 9 (Similar to or the rapid speech 940-3).
[0142] In some examples, speech rate information and estimated keyword starting indices can be determined in parallel. In some cases, speech rate information can be determined using a speech rate classification machine learning network in response to the detection of spoken keywords (e.g., spoken keywords detected by a first keyword detection machine learning network). The estimated keyword starting index can be determined using a keyword starting estimation neural network in response to the detection of spoken keywords (e.g., ...). Figure 8 825, Figure 9 925, Figure 6 (e.g., 632, etc.). In some examples, the keyword initiation estimation neural network can be combined with... Figure 8 The keyword initial estimation network 816 is the same as or similar to the first keyword detection machine learning network 812 used to perform keyword detection, and may be included in the same keyword detection and initial estimation system 820.
[0143] For example, the first keyword detection model and the keyword initial estimation neural network can be included in the first keyword detection level of a multi-level keyword detection system. The first keyword detection level can be combined with... Figure 2 The first keyword detection level is the same as or similar to 200, and / or can be related to Figure 8 Keyword detection and initial estimation system 820 and / or Figure 8 The first keyword detection level is the same as or similar to 800.
[0144] At box 1008, process 1000 includes obtaining an average speech length value corresponding to spoken keywords and speech rate information. For example, the average speech length value may be related to... Figure 8 The average keyword length is 842 and / or Figure 9 The average keyword length value is 942, which is the same or similar.
[0145] In some cases, the average spoken length value is included in the average keyword length information corresponding to the spoken keywords (e.g., Figure 8 The average keyword length is 842 and / or Figure 9 The average keyword length value is 942. In some examples, the average keyword length information includes the corresponding average speech length value for each of the multiple speech rate classifications associated with the speech rate classification machine learning network. For example, the corresponding average speech length value may include... Figure 9 The slow speech rate of 942 and the average spoken length of 940-1 are L. s , Figure 9 The normal speaking speed of 942 is 940-2, and the average spoken length value is L-. n and / or Figure 9 The average spoken length value L is 940-3, which is 942 rapid speech rate. f .
[0146] In some cases, the average keyword length information includes an offline estimate of the corresponding average speech length value. For example, Figure 9 The slow speech rate of 942 and the average spoken length of 940-1 are L. s Can correspond to Figure 7 The offline estimate for slow speakers is 730. Figure 9 The normal speaking speed of 942 and the average spoken length of 940-2 are L. n Can correspond to Figure 7 The offline estimate for normal speakers is 720. Figure 9 The average spoken length value L is 940-3, which is 942 rapid speech rate. f Can correspond to Figure 7 The offline estimate for fast speakers is 710.
[0147] In some cases, each corresponding average speech length value, included in the average keyword length information, is embedded in the machine learning model metadata associated with the configuration of the speech rate classification machine learning network or the configuration of the keyword index refinement machine learning network used to generate the refined keyword index. For example, Figure 8 The average keyword verbal length value is 842, which can be embedded in... Figure 8 The configuration of the speech rate classification network 830 is associated with the machine learning model metadata, and / or embedded in the metadata related to... Figure 8 The keyword index refines the configuration of machine learning networks (e.g., engines) in the associated machine learning model metadata of 840.
[0148] At box 1010, process 1000 includes generating a refined keyword index based on an estimated keyword index and an average spoken length value, wherein the refined keyword index includes a refined keyword start index offset to a time earlier than the estimated keyword start index and a refined keyword end index offset to a time later than the estimated keyword end index. For example, the refined keyword index may be compared with... Figure 8 The refined keyword index 845 is the same as or similar to it. In some examples, the refined keyword index includes keywords similar to those generated by... Figure 8 The refined keyword starting index corresponds to the estimated keyword starting index 825 determined by the keyword starting estimation network 816. In some cases, the refined keyword starting index may be the same as or similar to the actual or baseline true keyword starting index, such as... Figure 6 The actual (benchmark truth) keyword starting index 602 (e.g., relative to) Figure 6 The estimated keyword starting index (632). In some examples, the refined keyword ending index may be the same as or similar to the actual or benchmark truth keyword ending index, such as... Figure 6Actual (benchmark truth) keyword terminating index 604 (e.g., relative to) Figure 6 The estimated keyword termination index is 634.
[0149] In some cases, the estimated length of spoken keywords can be determined based on the difference between the estimated keyword ending index and the estimated keyword starting index, thereby generating a refined keyword index. For example, the estimated length may correspond to... Figure 6 The estimated keyword terminating index 634 and Figure 6 The estimated difference between the starting index 632 and the keyword.
[0150] In some examples, the estimated length of spoken keywords can be compared to the average spoken keyword length to determine the refined length of the spoken keywords. For example, one could use... Figure 8 The speech rate classification network 840 and / or Figure 9 The speech rate classification network 930 performs this comparison. For example, the estimated length of spoken keywords can be compared with... Figure 9 Estimated keyword length L est Same or similar, and this comparison can be made for slow speech 940-1 in L est With L s Between, for normal speaking speed 940-2 in L est With L n Between, or in L est- Between 940-3 and rapid speech rate. In some cases, a refined keyword index can be generated based on the refined length of spoken keywords (e.g., Figure 8 Detailed keyword index 845).
[0151] In some cases, the refined keyword index can be generated based on determining the refined keyword start index as a time index that is shifted forward by a first amount compared to the estimated keyword start index, where the first amount corresponds to the difference between the refined length and the estimated length of the spoken keyword. The refined keyword end index can be determined as a time index that is shifted backward by a second amount compared to the estimated keyword end index, where the second amount corresponds to the difference between the refined length and the estimated length of the spoken keyword.
[0152] In some cases, the first and second quantities are the same. In some examples, the first quantity includes a first percentage of the difference between the refined length and the estimated length of the spoken keyword. In some cases, the second quantity includes a second percentage of the difference between the refined length and the estimated length of the spoken keyword. In some examples, the first percentage is greater than the second percentage. In some cases, the first percentage is greater than 50%, and the sum of the first and second percentages equals 100%.
[0153] In some cases, the apparatus for implementing process 1000 may include a microphone configured to acquire one or more audio samples. In some cases, the apparatus may also include one or more microphones configured to capture one or more audio samples for keyword detection. In some examples, the one or more microphones and the first keyword detection model are associated with an always-on keyword detection process implemented by the apparatus.
[0154] In some cases, the processes described herein (e.g., process 1000 and / or any other processes described herein) may be performed by a computing device or apparatus. In one example, process 1000 and / or other techniques or processes described herein may be performed by... Figure 8 and / or Figure 9 The system executes this. In another example, process 1000 and / or other techniques or processes described herein may be performed by... Figure 11 The computing system 1100 shown in the figure executes. For example, it has... Figure 11 The computing device architecture of the computing system 1100 shown can realize the operation of process 1000, and / or can realize the operation of this document relative to... Figures 1 to 9 One or more of the components and / or operations described in any of the diagrams.
[0155] In some cases, a computing device or apparatus may include various components such as one or more input devices, one or more output devices, one or more processors, one or more microprocessors, one or more microcomputers, one or more cameras, one or more sensors, and / or other components configured to perform the steps of the processes described herein. In some examples, a computing device may include a display, one or more network interfaces configured to transmit and / or receive data, any combination thereof, and / or other components. One or more network interfaces may be configured to transmit and / or receive wired and / or wireless data, including data according to 3G, 4G, 5G, and / or other cellular standards, data according to the WiFi (802.11x) standard, and data according to Bluetooth. ™ Standard data, data according to the Internet Protocol (IP) standard, and / or other types of data.
[0156] Components of a computing device may be implemented in circuitry. For example, components may include electronic circuitry or other electronic hardware, and / or may be implemented using electronic circuitry or other electronic hardware, which may include one or more programmable electronic circuits (e.g., a microprocessor, graphics processing unit (GPU), digital signal processor (DSP), central processing unit (CPU), and / or other suitable electronic circuitry), and / or may include computer software, firmware, or any combination thereof for performing the various operations described herein, and / or may be implemented using computer software, firmware, or any combination thereof for performing the various operations described herein.
[0157] Process 1000 is illustrated as a logic flowchart, whose operations represent a sequence of operations that can be implemented in hardware, computer instructions, or combinations thereof. In the context of computer instructions, each operation represents a computer-executable instruction stored on one or more computer-readable storage media that performs the described operation when executed by one or more processors. Generally, computer-executable instructions include routines, programs, objects, components, data structures, etc., that perform a specific function or implement a specific data type. The order in which the operations are described is not intended to be construed as limiting, and any number of described operations can be combined in any order and / or in parallel to implement the process.
[0158] Additionally, process 1000 and / or other processes described herein may be executed under the control of one or more computer systems configured with executable instructions, and may be implemented as code (e.g., executable instructions, one or more computer programs, or one or more applications) that executes jointly on one or more processors, implemented in hardware, or implemented in a combination thereof. As noted above, the code may be stored on a computer-readable or machine-readable storage medium, for example, in the form of a computer program comprising multiple instructions executable by one or more processors. The computer-readable or machine-readable storage medium may be non-transitory.
[0159] Figure 11 This is a diagram illustrating an example of a system used to implement certain aspects of this technology. Specifically, Figure 11 An example of computing system 1100 is illustrated. This computing system can be any computing device, such as constituting an internal computing system, a remote computing system, a camera, or any component thereof, wherein the components of the system communicate with each other using connection 1105. Connection 1105 can be a physical connection using a bus, or a direct connection to processor 1110, such as in a chipset architecture. Connection 1105 can also be a virtual connection, a networking connection, or a logical connection.
[0160] In some aspects, computing system 1100 is a distributed system in which the functions described herein can be distributed across a data center, multiple data centers, a peer-to-peer network, etc. In some aspects, one or more system components described represent a number of such components that each perform some or all of the functions described for that component. In some aspects, components can be physical or virtual devices.
[0161] Example system 1100 includes at least one processing unit (CPU or processor) 1110 and a connection 1105 that communicatively couples various system components, including system memories 1115 such as read-only memory (ROM) 1120 and random access memory (RAM) 1125, to the processor 1110. Computing system 1100 may include a cache 1115 of high-speed memory that is directly connected to, closely proximates, or integrated into the processor 1110.
[0162] Processor 1110 may include any general-purpose processor and hardware or software services, such as services 1132, 1134, and 1136 stored in storage device 1130, which are configured to control processor 1110 and dedicated processors in which software instructions are incorporated into the actual processor design. Processor 1110 may be a substantially completely independent computing system containing multiple cores or processors, buses, memory controllers, caches, etc. Multi-core processors may be symmetric or asymmetric.
[0163] To enable user interaction, the computing system 1100 includes an input device 1145 that can represent any number of input mechanisms, such as a microphone for speech, a touch-sensitive screen for gesture or graphic input, a keyboard, a mouse, motion input, speech, etc. The computing system 1100 may also include an output device 1135 that can be one or more of a plurality of output mechanisms. In some cases, a multimodal system allows the user to provide multiple types of input / output to communicate with the computing system 1100.
[0164] The computing system 1100 may include a communication interface 1140, which typically controls and manages user input and system output. The communication interface may perform or facilitate the receiving and / or transmitting of wired or wireless communications using wired and / or wireless transceivers, including utilizing audio jacks / plugs, microphone jacks / plugs, Universal Serial Bus (USB) ports / plugs, Apple... ™ Lightning ™ Ports / plugs, Ethernet ports / plugs, fiber optic ports / plugs, dedicated wired ports / plugs, 3G, 4G, 5G and / or other cellular data network wireless signal transmission, Bluetooth ™ Wireless signal transmission, Bluetooth™ Low-power (BLE) wireless signal transmission, IBEACON ™ Wireless signal transmission, including radio frequency identification (RFID) wireless signal transmission, near field communication (NFC) wireless signal transmission, dedicated short range communication (DSRC) wireless signal transmission, 802.11 Wi-Fi wireless signal transmission, wireless local area network (WLAN) signal transmission, visible light communication (VLC), microwave access global interoperability (WiMAX), infrared (IR) wireless signal transmission, public switched telephone network (PSTN) signal transmission, integrated services digital network (ISDN) signal transmission, ad hoc network signal transmission, radio wave signal transmission, microwave signal transmission, infrared signal transmission, visible light signal transmission, ultraviolet light signal transmission, wireless signal transmission along the electromagnetic spectrum, or some combination thereof. The communication interface 1140 may also include one or more Global Navigation Satellite System (GNSS) receivers or transceivers for determining the location of the computing system 1100 based on one or more signals received from one or more satellites associated with one or more GNSS systems. GNSS systems include, but are not limited to, the U.S. Global Positioning System (GPS), Russia's Global Navigation Satellite System (GLONASS), China's BeiDou Navigation Satellite System (BDS), and Europe's Galileo GNSS. There are no limitations on operation on any particular hardware configuration, and therefore the underlying features here can be easily replaced to obtain improved hardware or firmware configurations as they are developed.
[0165] Storage device 1130 may be a non-volatile and / or non-transitory and / or computer-readable storage device, and may be a hard disk or other type of computer-readable medium capable of storing data accessible by a computer, such as magnetic tape, flash memory cards, solid-state storage devices, digital versatile discs, cartridges, floppy disks, hard disks, magnetic tapes, magnetic stripes, any other magnetic storage media, flash memory, memristor memory, any other solid-state storage, CD-ROM, rewritable CD, digital video disc (DVD), Blu-ray disc (BDD), holographic disc, another optical medium, secure digital (SD) card, micro-secure digital (microSD) card, Memory Stick. ®Cards, smart card chips, EMV chips, Subscriber Identity Module (SIM) cards, mini / micro / nano / micro SIM cards, another integrated circuit (IC) chip / card, random access memory (RAM), static RAM (SRAM), dynamic RAM (DRAM), read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), flash EPROM, cache memory (e.g., level 1 (L1) cache, level 2 (L2) cache, level 3 (L3) cache, level 4 (L4) cache, level 5 (L5) cache or other (L#) cache), resistive random access memory (RRAM / ReRAM), phase change memory (PCM), spin-transfer torque RAM (STT-RAM), another memory chip or cassette and / or combinations thereof.
[0166] Storage device 1130 may include software services, servers, services, etc., which enable the system to perform functions when the code defining such software is executed by processor 1110. In some aspects, hardware services performing specific functions may include software components stored in a computer-readable medium connected to necessary hardware components, such as processor 1110, connection 1105, output device 1135, etc., to perform functions. The term "computer-readable medium" includes, but is not limited to, portable or non-portable storage devices, optical storage devices, and various other media capable of storing, containing, or carrying instructions and / or data. Computer-readable media may include non-transitory media in which data can be stored and which does not include carrier waves and / or transient electronic signals propagated wirelessly or via a wired connection. Examples of non-transitory media may include, but are not limited to, magnetic disks or magnetic tapes, optical storage media (such as compact discs (CDs) or digital versatile discs (DVDs)), flash memory, memory, or memory devices. Computer-readable media may store code and / or machine-executable instructions thereon, which may represent procedures, functions, subroutines, programs, routines, subroutines, modules, software packages, classes, or any combination of instructions, data structures, or program statements. A code segment may be coupled to another code segment or hardware circuitry by passing and / or receiving information, data, arguments, parameters, or memory contents. Information, arguments, parameters, data, etc., may be passed, forwarded, or transmitted via any suitable means, including memory sharing, message passing, token passing, network transmission, etc.
[0167] Specific details have been provided in the foregoing description to offer a thorough understanding of the aspects and examples presented herein, but those skilled in the art will recognize that this application is not limited thereto. Therefore, although illustrative aspects of this application have been described in detail herein, it is to be understood that the inventive concepts can be implemented and employed in a variety of other ways, and the appended claims are not intended to be construed as including such variations unless limited by prior art. The various features and aspects of the applications described above can be used individually or in combination. Furthermore, without departing from the broader scope of the specification, aspects can be utilized in any number of environments and applications beyond those described herein. Therefore, the specification and drawings should be considered illustrative rather than restrictive. For illustrative purposes, the methods are described in a particular order. It should be understood that, in alternative aspects, the methods may be performed in a different order than described.
[0168] For clarity, in some instances, this technology may be presented as comprising various functional blocks, which include devices, device components, steps, or routines embodied in a method, either in software or a combination of hardware and software. Additional components may be used in addition to those shown in the figures and / or described herein. For example, circuits, systems, networks, processes, and other components may be shown as components in block diagram form to avoid obscuring these aspects in unnecessary detail. In other cases, well-known circuits, processes, algorithms, structures, and techniques may be shown without unnecessary detail to avoid obscuring the aspects.
[0169] Furthermore, those skilled in the art will understand that the various exemplary logic blocks, modules, circuits, and algorithm steps described in connection with the aspects disclosed herein can be implemented as electronic hardware, computer software, or a combination of both. To clearly illustrate this interchangeability between hardware and software, various exemplary components, blocks, modules, circuits, and steps have been described above in general terms of their functionality. Whether such functionality is implemented as hardware or software depends on the specific application and the design constraints imposed on the overall system. Those skilled in the art may implement the described functionality in different ways for each specific application, but such implementation decisions should not be construed as departing from the scope of this disclosure.
[0170] The various aspects described above can be presented as a process or method, depicted as a flowchart, diagrammatic flowchart, data flow diagram, structural diagram, or block diagram. Although a flowchart can describe operations as a sequential process, many operations within an operation can be executed in parallel or concurrently. Furthermore, the order of operations can be rearranged. A process terminates when its operations are completed, but a process may have additional steps not included in the diagrams. A process can correspond to a method, function, procedure, subroutine, subroutine, etc. When a process corresponds to a function, the termination of the process can correspond to the function returning to the calling function or the main function.
[0171] The processes and methods described in the examples above can be implemented using stored computer-executable instructions or computer-executable instructions otherwise available from a computer-readable medium. Such instructions may include, for example, instructions and data that configure, or otherwise configure, a general-purpose computer, special-purpose computer, or processing device to perform a function or group of functions. The portion may be accessible via a network of the computer resources used. The computer-executable instructions may be, for example, binary, intermediate format instructions such as assembly language, firmware, or source code. Examples of computer-readable media that can be used to store the instructions, the information used, and / or information created during the methods according to the described examples include disks or optical discs, flash memory, USB devices with non-volatile memory, networked storage devices, etc.
[0172] In some respects, computer-readable storage devices, media, and memories may include cables or wireless signals containing bit streams, etc. However, when referred to, non-transitory computer-readable storage media explicitly exclude media such as energy, carrier signals, electromagnetic waves, and the signals themselves.
[0173] Those skilled in the art will understand that information and signals can be represented using any of a variety of different techniques and arts. For example, data, instructions, commands, information, signals, bits, symbols, and chips that may be referred to throughout the above description may, in some cases, be represented by voltage, current, electromagnetic waves, magnetic fields or magnetic particles, light fields or light particles, or any combination thereof, depending in part on the specific application, in part on the desired design, in part on the corresponding technology, etc.
[0174] The various exemplary logic blocks, modules, and circuits described in conjunction with the aspects disclosed herein can be implemented or executed using hardware, software, firmware, middleware, microcode, hardware description languages, or any combination thereof, and can take any form factor of various form factors. When implemented in software, firmware, middleware, or microcode, program code or code segments (e.g., computer program products) for performing necessary tasks can be stored in a computer-readable or machine-readable medium. A processor can perform the necessary tasks. Examples of form factors include: laptop computers, smartphones, mobile phones, tablet devices, or other small form factor personal computers, personal digital assistants, rack-mounted devices, self-contained devices, etc. The functionality described herein can also be embodied in peripheral devices or interlocking cards. By further example, such functionality can also be implemented on circuit boards in different chips or different processes running on a single device.
[0175] Instructions, media for delivering such instructions, computing resources for executing them, and other structures for supporting such computing resources are example components for providing the functionality described in this disclosure.
[0176] The techniques described herein can also be implemented in electronic hardware, computer software, firmware, or any combination thereof. Such techniques can be implemented in any of a variety of devices, such as general-purpose computers, wireless communication devices (mobile phones), or integrated circuit devices with multiple uses, including applications in wireless communication devices (mobile phones) and other devices. Any feature described as a module or component can be implemented together in an integrated logic device or separately as discrete but interoperable logic devices. If implemented in software, the techniques can be implemented at least in part by a computer-readable data storage medium comprising program code including instructions that, when executed, perform one or more of the methods, algorithms, and / or operations described above. The computer-readable data storage medium can form part of a computer program product, which may include packaging materials. The computer-readable medium may include memory or data storage media, such as random access memory (RAM) (such as synchronous dynamic random access memory (SDRAM)), read-only memory (ROM), non-volatile random access memory (NVRAM), electrically erasable programmable read-only memory (EEPROM), flash memory, magnetic or optical data storage media, etc. Additionally or alternatively, the technology may be implemented at least in part by a computer-readable communication medium that carries or conveys program code in the form of instructions or data structures that can be accessed, read and / or executed by a computer, such as propagated signals or waves.
[0177] The program code can be executed by a processor, which may include one or more processors, such as one or more digital signal processors (DSPs), general-purpose microprocessors, application-specific integrated circuits (ASICs), field-programmable arrays (FPGAs), or other equivalent integrated or discrete logic circuits. Such processors can be configured to perform any of the techniques described in this disclosure. A general-purpose processor may be a microprocessor; however, in alternatives, the processor may be any conventional processor, controller, microcontroller, or state machine. The processor may also be implemented as a combination of computing devices, such as a combination of a DSP and a microprocessor, multiple microprocessors, one or more microprocessors combined with a DSP core, or any other such configuration. Therefore, as used herein, the term "processor" may refer to any of the foregoing structures, any combination of the foregoing structures, or any other structure or means suitable for implementing the techniques described herein.
[0178] Those skilled in the art will appreciate that the less than ("<") and greater than (">") symbols or terms used herein can be represented by less than or equal to ("<") respectively. ") and greater than or equal to (" The symbol '(')' is used to replace the existing description without deviating from its scope.
[0179] When a component is described as being “configured” to perform certain operations, such configuration can be achieved, for example, by designing electronic circuits or other hardware to perform the operations, by programming programmable electronic circuits (e.g., microprocessors or other suitable electronic circuits) to perform the operations, or any combination thereof.
[0180] The phrase “coupled to” or “communicatively coupled to” means that any component is physically connected directly or indirectly to another component, and / or that any component is in communication with another component directly or indirectly (e.g., connected to that other component via a wired or wireless connection and / or other suitable communication interface).
[0181] Claim language or other languages that state "at least one of" and / or "one or more of" in a set indicate that one member of the set or multiple members of the set (in any combination) satisfy the claim. For example, claim language stating "at least one of A and B" or "at least one of A or B" means A, B, or A and B. In another example, claim language stating "at least one of A, B, and C" or "at least one of A, B, or C" means A, B, C, or A and B, or A and C, or B and C, A and B and C, or any repetition is information or data (e.g., A and A, B and B, C and C, A and A and B, etc.), or any other ordering, repetition, or combination of A, B, and C. The language "at least one of" and / or "one or more of" in a set does not limit the set to the items listed in the set. For example, the language of a claim stating "at least one of A and B" or "at least one of A or B" may refer to A, B, or A and B, and may additionally include items not listed in the set of A and B. The phrases "at least one" and "one or more" are used interchangeably herein.
[0182] Claim language or other languages that state "at least one processor, at least one processor is configured to," "at least one processor is configured to," "one or more processors, one or more processors are configured to," "one or more processors are configured to," etc., indicate that one or more processors (in any combination) can perform associated operations. For example, claim language stating "at least one processor, at least one processor is configured to: X, Y, and Z" means that a single processor can be used to perform operations X, Y, and Z; or that multiple processors are each assigned a specific subset of tasks of operations X, Y, and Z, such that the multiple processors together perform X, Y, and Z; or that a group of multiple processors work together to perform operations X, Y, and Z. In another example, claim language stating "at least one processor, at least one processor is configured to: X, Y, and Z" could mean that any single processor can perform only at least one subset of operations X, Y, and Z.
[0183] When referring to one or more elements that perform functions (e.g., steps of a method), one element may perform all functions, or more than one element may jointly perform these functions. When more than one element jointly performs these functions, each function does not need to be performed by every single element (e.g., different functions may be performed by different elements), and / or each function does not need to be performed by only one element as a whole (e.g., different elements may perform different sub-functions of a function). Similarly, when referring to one or more elements configured to cause another element (e.g., a device) to perform functions, one element may be configured to cause another element to perform all functions, or more than one element may be jointly configured to cause another element to perform these functions.
[0184] When referring to an entity that performs or is configured to perform functions (e.g., steps of a method) (e.g., any entity or device described herein), the entity may be configured to cause one or more elements (individually or collectively) to perform those functions. One or more components of the entity may include at least one memory, at least one processor, at least one communication interface, another component configured to perform one or more of those functions, and / or any combination thereof. When referring to an entity that performs functions, the entity may be configured to cause one component to perform all functions, or to cause more than one component to perform those functions collectively. When the entity is configured to cause more than one component to perform those functions collectively, each function does not need to be performed by every single component (e.g., different functions may be performed by different components), and / or each function does not need to be performed by only one component as a whole (e.g., different components may perform different sub-functions of a function).
[0185] The exemplary aspects of this disclosure include:
[0186] Aspect 1. An apparatus for processing one or more audio samples, the apparatus comprising: one or more memories configured to store the one or more audio samples; and one or more processors coupled to the one or more memories, the one or more processors configured to: detect spoken keywords within audio samples in the one or more audio samples using a first keyword detection model; determine an estimated keyword index corresponding to the detection of the spoken keywords within the audio samples, the estimated keyword index including an estimated keyword start index and an estimated keyword end index; determine speech rate information corresponding to the audio samples using a speech rate classification machine learning network; obtain an average spoken length value corresponding to the spoken keywords and the speech rate information; and generate a refined keyword index based on the estimated keyword index and the average spoken length value, wherein the refined keyword index includes a refined keyword start index offset to a time earlier than the estimated keyword start index.
[0187] Aspect 2. The apparatus according to aspect 1, wherein the speech rate information indicates a slow speech rate classification, a normal speech rate classification, or a fast speech rate classification for the spoken keywords within the audio sample.
[0188] Aspect 3. The apparatus according to any one of Aspects 1 to 2, wherein the one or more processors are configured to determine the speech rate information and the estimated keyword starting index in parallel.
[0189] Aspect 4. The apparatus according to any one of Aspects 1 to 3, wherein the one or more processors are configured to: determine the speech rate information using the speech rate classification machine learning network in response to detecting the spoken keyword; and determine the estimated keyword starting index using a keyword starting estimation neural network in response to detecting the spoken keyword.
[0190] Aspect 5. The apparatus according to aspect 4, wherein: the first keyword detection model and the keyword initial estimation neural network are included in the first keyword detection level of the multi-level keyword detection system.
[0191] Aspect 6. The apparatus according to any one of Aspects 1 to 5, wherein: the first keyword detection model is configured to perform always-on keyword detection on one or more audio samples; and the speech rate classification machine learning network is configured to perform speech rate classification on the specific audio sample based on the spoken keyword detected by the first keyword detection model within the specific audio sample of the one or more audio samples.
[0192] Aspect 7. The apparatus according to any one of Aspects 1 to 6, wherein: the average speech length value is included in average keyword length information corresponding to the speech keywords; and the average keyword length information includes a corresponding average speech length value for each of a plurality of speech speed classifications associated with the speech speed classification machine learning network.
[0193] Aspect 8. The apparatus according to aspect 7, wherein the average keyword length information includes an offline estimate of the corresponding average spoken length value.
[0194] Aspect 9. The apparatus according to any one of Aspects 7 to 8, wherein each corresponding average speech length value in the average keyword length information is embedded in machine learning model metadata associated with the configuration of the speech rate classification machine learning network or the configuration of the keyword index refinement machine learning network for generating the refined keyword index.
[0195] Aspect 10. The apparatus according to any one of Aspects 1 to 9, wherein, in order to generate the refined keyword index, the one or more processors are configured to: determine an estimated length of the spoken keyword based on the difference between the estimated keyword termination index and the estimated keyword start index; compare the estimated length with the average spoken length value to determine the refined length of the spoken keyword; and generate the refined keyword index based on the refined length of the spoken keyword.
[0196] Aspect 11. The apparatus according to aspect 10, wherein, in order to generate the refined keyword index, the one or more processors are configured to: determine the refined keyword start index as a time index that is shifted forward by a first amount in time compared to the estimated keyword start index, the first amount corresponding to the difference between the refined length and the estimated length of the spoken keyword; and determine the refined keyword end index as a time index that is shifted backward in time compared to the estimated keyword end index, the second amount corresponding to the difference between the refined length and the estimated length of the spoken keyword.
[0197] Aspect 12. The apparatus according to aspect 11, wherein the first quantity and the second quantity are the same.
[0198] Aspect 13. The apparatus according to any one of Aspects 11 to 12, wherein: the first quantity includes a first percentage of the difference between the refined length and the estimated length of the spoken keyword; and the second quantity includes a second percentage of the difference between the refined length and the estimated length of the spoken keyword.
[0199] Aspect 14. The apparatus according to aspect 13, wherein the first percentage is greater than the second percentage.
[0200] Aspect 15. The apparatus according to any one of Aspects 13 to 14, wherein the first percentage is greater than 50%, and wherein the sum of the first percentage and the second percentage is equal to 100%.
[0201] Aspect 16. The apparatus according to any one of aspects 1 to 15, the apparatus further comprising a microphone configured to acquire the one or more audio samples.
[0202] Aspect 17. The apparatus according to any one of aspects 1 to 16, the apparatus further comprising: one or more microphones configured to capture the one or more audio samples for keyword detection.
[0203] Aspect 18. The apparatus according to aspect 17, wherein the one or more microphones and the first keyword detection model are associated with an always-on keyword detection process implemented by the apparatus.
[0204] Aspect 19. A processor implementation method for processing one or more audio samples, the method comprising: detecting spoken keywords within audio samples in the one or more audio samples using a first keyword detection model; determining an estimated keyword index corresponding to the detection of the spoken keywords within the audio samples, the estimated keyword index including an estimated keyword start index and an estimated keyword end index; determining speech rate information corresponding to the audio samples using a speech rate classification machine learning network; obtaining an average spoken length value corresponding to the spoken keywords and the speech rate information; and generating a refined keyword index based on the estimated keyword index and the average spoken length value, wherein the refined keyword index includes a refined keyword start index offset to a time earlier than the estimated keyword start index.
[0205] Aspect 20. The processor-implemented method according to aspect 19, wherein the speech rate information indicates a slow speech rate classification, a normal speech rate classification, or a fast speech rate classification for the spoken keywords within the audio sample.
[0206] Aspect 21. The processor-implemented method according to any one of Aspects 19 to 20, the processor-implemented method further comprising determining the speech rate information and the estimated keyword starting index in parallel.
[0207] Aspect 22. The processor-implemented method according to any one of Aspects 19 to 21, the processor-implemented method further comprising: determining the speech rate information using the speech rate classification machine learning network in response to detecting the spoken keyword; and determining the estimated keyword starting index using a keyword starting estimation neural network in response to detecting the spoken keyword.
[0208] Aspect 23. The processor-implemented method according to aspect 22, wherein: the first keyword detection model and the keyword initial estimation neural network are included in the first keyword detection level of the multi-level keyword detection system.
[0209] Aspect 24. A method implemented by a processor according to any one of Aspects 19 to 23, wherein: the first keyword detection model is configured to perform always-on keyword detection on one or more audio samples; and the speech rate classification machine learning network is configured to perform speech rate classification on the specific audio sample based on the spoken keyword detected by the first keyword detection model within the specific audio sample of the one or more audio samples.
[0210] Aspect 25. A method implemented by a processor according to any one of Aspects 19 to 24, wherein: the average speech length value is included in average keyword length information corresponding to the speech keywords; and the average keyword length information includes a corresponding average speech length value for each of a plurality of speech speed classifications associated with the speech speed classification machine learning network.
[0211] Aspect 26. The processor-implemented method according to aspect 25, wherein the average keyword length information includes an offline estimate of the corresponding average speech length value.
[0212] Aspect 27. A processor-implemented method according to any one of Aspects 25 to 26, comprising embedding each corresponding average speech length value in the average keyword length information in machine learning model metadata associated with the configuration of the speech rate classification machine learning network or the configuration of the keyword index refinement machine learning network for generating the refined keyword index.
[0213] Aspect 28. A method implemented by a processor according to any one of Aspects 19 to 27, wherein generating the refined keyword index comprises: determining an estimated length of the spoken keyword based on the difference between the estimated keyword termination index and the estimated keyword start index; comparing the estimated length with the average spoken length value to determine the refined length of the spoken keyword; and generating the refined keyword index based on the refined length of the spoken keyword.
[0214] Aspect 29. The processor-implemented method according to aspect 28, wherein generating the refined keyword index comprises: determining the refined keyword start index as a time index that is shifted forward by a first amount in time compared to the estimated keyword start index, the first amount corresponding to the difference between the refined length and the estimated length of the spoken keyword; and determining the refined keyword end index as a time index that is shifted backward by a second amount in time compared to the estimated keyword end index, the second amount corresponding to the difference between the refined length and the estimated length of the spoken keyword.
[0215] Aspect 30. The processor-implemented method according to aspect 29, wherein the first quantity and the second quantity are the same.
[0216] Aspect 31. A method implemented by a processor according to any one of Aspects 29 to 30, wherein: the first quantity includes a first percentage of the difference between the refined length and the estimated length of the spoken keyword; and the second quantity includes a second percentage of the difference between the refined length and the estimated length of the spoken keyword.
[0217] Aspect 32. The processor-implemented method according to aspect 31, wherein the first percentage is greater than the second percentage.
[0218] Aspect 33. The method implemented by the processor according to any one of aspects 31 to 32, wherein the first percentage is greater than 50%, and wherein the sum of the first percentage and the second percentage is equal to 100%.
[0219] Aspect 34. A non-transitory computer-readable storage medium comprising instructions stored thereon, the instructions causing the at least one processor, when executed by at least one processor, to perform any one of aspects 1 to 18.
[0220] Aspect 35. A non-transitory computer-readable storage medium comprising instructions stored thereon, the instructions causing the at least one processor, when executed by at least one processor, to perform any one of aspects 19 to 33.
[0221] Aspect 36. An apparatus comprising one or more components for performing an operation according to any one of aspects 1 to 18.
[0222] Aspect 37. An apparatus comprising one or more components for performing the operation according to any one of aspects 19 to 33.
Claims
1. An apparatus for processing one or more audio samples, the apparatus comprising: One or more memories, the one or more memories being configured to store the one or more audio samples; and One or more processors, said one or more processors coupled to said one or more memories, said one or more processors being configured to: The first keyword detection model is used to detect spoken keywords within audio samples in one or more audio samples; Determine an estimated keyword index corresponding to the detected spoken keywords within the audio sample, the estimated keyword index including an estimated keyword start index and an estimated keyword end index; A speech rate classification machine learning network is used to determine the speech rate information corresponding to the audio samples; Obtain the average spoken length value corresponding to the spoken keywords and the speech rate information; as well as A refined keyword index is generated based on the estimated keyword index and the average spoken length value, wherein the refined keyword index includes a refined keyword starting index that is offset to a time earlier than the estimated keyword starting index.
2. The apparatus of claim 1, wherein the speech rate information indicates a slow speech rate classification, a normal speech rate classification, or a fast speech rate classification for the spoken keywords within the audio sample.
3. The apparatus of claim 1, wherein the one or more processors are configured to determine the speech rate information and the estimated keyword starting index in parallel.
4. The apparatus of claim 1, wherein the one or more processors are configured to: In response to the detection of the spoken keywords, the speech rate classification machine learning network is used to determine the speech rate information; and In response to the detection of the spoken keyword, the estimated keyword starting index is determined using a keyword starting estimation neural network.
5. The apparatus according to claim 4, wherein: The first keyword detection model and the keyword initial estimation neural network are included in the first keyword detection level of the multi-level keyword detection system.
6. The apparatus according to claim 1, wherein: The first keyword detection model is configured to perform always-on keyword detection on one or more audio samples; and The speech rate classification machine learning network is configured to perform speech rate classification on a specific audio sample based on the spoken keywords detected by the first keyword detection model within a specific audio sample in one or more audio samples.
7. The apparatus according to claim 1, wherein: The average spoken length value is included in the average keyword length information corresponding to the spoken keywords; and The average keyword length information includes the corresponding average speech length value for each of the multiple speech rate classifications associated with the speech rate classification machine learning network.
8. The apparatus of claim 7, wherein the average keyword length information includes an offline estimate of the corresponding average spoken length value.
9. The apparatus of claim 7, wherein each corresponding average speech length value in the average keyword length information is embedded in machine learning model metadata associated with the configuration of the speech rate classification machine learning network or the configuration of the keyword index refinement machine learning network for generating the refined keyword index.
10. The apparatus of claim 1, wherein, in order to generate the refined keyword index, the one or more processors are configured to: The estimated length of the spoken keyword is determined based on the difference between the estimated keyword termination index and the estimated keyword start index. The estimated length is compared with the average spoken length value to determine the refined length of the spoken keywords; and The refined keyword index is generated based on the refined length of the spoken keywords.
11. The apparatus of claim 10, wherein, in order to generate the refined keyword index, the one or more processors are configured to: The refined keyword starting index is determined as a time index that is shifted forward by a first amount compared to the estimated keyword starting index, the first amount corresponding to the difference between the refined length and the estimated length of the spoken keyword; and The refined keyword termination index is determined as a time index that is shifted backward by a second amount compared to the estimated keyword termination index, the second amount corresponding to the difference between the refined length and the estimated length of the spoken keyword.
12. The apparatus of claim 11, wherein the first quantity and the second quantity are the same.
13. The apparatus according to claim 11, wherein: The first quantity includes a first percentage of the difference between the refined length and the estimated length of the spoken keyword; and The second quantity includes a second percentage of the difference between the refined length and the estimated length of the spoken keyword.
14. The apparatus of claim 13, wherein the first percentage is greater than the second percentage.
15. The apparatus of claim 13, wherein the first percentage is greater than 50%, and wherein the sum of the first percentage and the second percentage is equal to 100%.
16. The apparatus of claim 1, further comprising a microphone configured to acquire the one or more audio samples.
17. The apparatus according to claim 1, further comprising: One or more microphones configured to capture one or more audio samples for keyword detection.
18. The apparatus of claim 17, wherein the one or more microphones and the first keyword detection model are associated with an always-on keyword detection process implemented by the apparatus.
19. A processor-implemented method for processing one or more audio samples, the method comprising: The first keyword detection model is used to detect spoken keywords within audio samples in one or more audio samples; Determine an estimated keyword index corresponding to the detected spoken keywords within the audio sample, the estimated keyword index including an estimated keyword start index and an estimated keyword end index; A speech rate classification machine learning network is used to determine the speech rate information corresponding to the audio samples; Obtain the average spoken length value corresponding to the spoken keywords and the speech rate information; as well as A refined keyword index is generated based on the estimated keyword index and the average spoken length value, wherein the refined keyword index includes a refined keyword starting index that is offset to a time earlier than the estimated keyword starting index.
20. The processor-implemented method of claim 19, wherein the speech rate information indicates a slow speech rate classification, a normal speech rate classification, or a fast speech rate classification for the spoken keywords within the audio sample.
21. The processor-implemented method of claim 19, further comprising determining the speech rate information and the estimated keyword starting index in parallel.
22. The processor-implemented method according to claim 19, further comprising: In response to the detection of the spoken keywords, the speech rate classification machine learning network is used to determine the speech rate information; as well as In response to the detection of the spoken keyword, the estimated keyword starting index is determined using a keyword starting estimation neural network.
23. The processor-implemented method according to claim 22, wherein: The first keyword detection model and the keyword initial estimation neural network are included in the first keyword detection level of the multi-level keyword detection system.
24. The processor-implemented method according to claim 19, wherein: The first keyword detection model is configured to perform always-on keyword detection on one or more audio samples; and The speech rate classification machine learning network is configured to perform speech rate classification on a specific audio sample based on the spoken keywords detected by the first keyword detection model within a specific audio sample in one or more audio samples.
25. The processor-implemented method according to claim 19, wherein: The average spoken length value is included in the average keyword length information corresponding to the spoken keywords; and The average keyword length information includes the corresponding average speech length value for each of the multiple speech rate classifications associated with the speech rate classification machine learning network.
26. The processor-implemented method of claim 25, wherein the average keyword length information includes an offline estimate of the corresponding average spoken length value.
27. The processor-implemented method of claim 25, wherein each corresponding average speech length value in the average keyword length information is embedded in machine learning model metadata associated with the configuration of the speech rate classification machine learning network or the configuration of the keyword index refinement machine learning network used to generate the refined keyword index.
28. The processor-implemented method of claim 19, wherein generating the refined keyword index comprises: The estimated length of the spoken keyword is determined based on the difference between the estimated keyword termination index and the estimated keyword start index. The estimated length is compared with the average spoken length value to determine the refined length of the spoken keywords; as well as The refined keyword index is generated based on the refined length of the spoken keywords.
29. The processor-implemented method of claim 28, wherein generating the refined keyword index comprises: The refined keyword starting index is determined as a time index that is shifted forward by a first amount in time compared to the estimated keyword starting index, the first amount corresponding to the difference between the refined length and the estimated length of the spoken keyword; as well as The refined keyword termination index is determined as a time index that is shifted backward by a second amount compared to the estimated keyword termination index, the second amount corresponding to the difference between the refined length and the estimated length of the spoken keyword.
30. The processor-implemented method of claim 29, wherein the first quantity and the second quantity are the same, and wherein: The first quantity includes a first percentage of the difference between the refined length and the estimated length of the spoken keyword; and The second quantity includes a second percentage of the difference between the refined length and the estimated length of the spoken keyword.