Generative neural networks with effective audio token processing
The generative neural network integrates text and timing tokens to address alignment challenges, reducing computational complexity and latency in speech recognition by generating unified output sequences, improving accuracy and efficiency.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- GDM HOLDING LLC
- Filing Date
- 2025-12-11
- Publication Date
- 2026-06-18
AI Technical Summary
Existing generative neural networks face challenges in maintaining precise alignment between audio and text modalities, particularly in speech recognition, leading to synchronization errors and increased computational complexity due to separate transcription and alignment stages, which are inefficient for real-time applications.
A generative neural network that processes audio input to generate a unified output sequence integrating text and timing tokens, eliminating the need for separate acoustic models and post-processing alignment, using a high-base numeral system for compact timestamp representation and training with interleaved audio and text segments to improve alignment accuracy.
This approach reduces computational load and latency by generating transcription and alignment simultaneously, ensuring precise temporal alignment without post-processing, thus enhancing the accuracy and efficiency of speech recognition.
Smart Images

Figure US2025059180_18062026_PF_FP_ABST
Abstract
Description
[0001] Attorney Docket No. 45288-0595WO1
[0002] GENERATIVE NEURAL NETWORKS WITH EFFECTIVE AUDIO TOKEN
[0003] PROCESSING
[0004] CROSS-REFERENCE TO RELATED APPLICATION
[0005] This application claims priority to U.S. Application No. 63 / 730,955, filed December 11, 2024. The disclosure of the foregoing application is hereby incorporated by reference in its entirety.
[0006] BACKGROUND
[0007] This specification relates to training neural networks to generate output sequences. For example, the output sequences can include text sequences, audio sequences, pixel sequences (that represent an image or a video frame), and so on.
[0008] Neural networks are machine learning models that employ one or more layers of nonlinear units to predict an output for a received input. Some neural networks include one or more hidden layers in addition to an output layer. The output of each hidden layer is used as input to the next layer in the network, i.e., another hidden layer or the output layer. Each layer of the network generates an output from a received input in accordance with current values of a respective set of parameters.
[0009] SUMMARY
[0010] This specification describes a system implemented as computer programs on one or more computers in one or more locations that uses a neural network to perform one or more tasks that require generating or processing audio. In some situations, the neural network can thus be referred to as a generative neural network.
[0011] In particular, the generative neural network generates output sequences of output tokens, wherein each output token is selected from a vocabulary of tokens that includes at least text tokens and audio tokens. For example, text tokens can be generated from text by applying a text tokenizer to the text. Similarly, audio tokens can be generated from audio by applying an audio tokenizer to audio.
[0012] This specification generally describes a variety of techniques for improving the performance of the generative neural network on tasks that require processing audio data.
[0013] For example, this specification generally describes how to incorporate timing information in speech recognition outputs. Attorney Docket No. 45288-0595WO1
[0014] As another example, this specification generally describes how to train the generative neural network to generate output that interleave audio and text tokens, improving the performance on a variety of tasks.
[0015] As yet another example, this specification describes how to format inputs to a generative neural networks during dialogue sessions that include received audio (and optionally video) and generated audio at the same time step.
[0016] Particular embodiments of the subject matter described in this specification can be implemented so as to realize one or more of the following advantages.
[0017] Existing automatic speech recognition (ASR) systems, particularly end-to-end generative models, can generate accurate text transcriptions of audio speech in speech recognition outputs but often lack the ability to provide precise temporal alignment (timestamps) for the recognized words or groups of words of the speech. In some cases, obtaining word-level timestamps requires a secondary' post-processing step, such as forced alignment using an external model or a hybrid system approach. This multi-stage processing increases computational complexity, introduces latency, and consumes additional memory resources, making it inefficient for applications requiring real-time synchronization of speech and text. Moreover, without the multi-stage processing, these generative models cannot accurately reply to queries that require determining a temporal alignment for certain words in the input audio signal.
[0018] This specification describes techniques that can address the aforementioned challenges. That is, the specification describes techniques that can utilize a generative neural network to process audio input and produce, as output, a single output sequence that integrates both text tokens and timing tokens. Instead of treating transcription and alignment as separate tasks, the system selects tokens from a unified vocabulary containing both linguistic units and temporal markers. The system then parses this sequence to identify contiguous timing tokens associated with specific text units, decodes them into timestamps, and outputs a speech recognition output that includes a transcription where the text is explicitly linked to the time it was spoken within the audio window.
[0019] By processing the network input using a generative neural network to generate an output sequence where each output token is selected from a vocabulary of tokens that includes a plurality' of text tokens and a plurality' of timing tokens, the described techniques eliminate the need for separate acoustic models or post-processing alignment stages. This creates a more efficient "single-pass" architecture that reduces computational load and latency by generating transcription and alignment data simultaneously. Attorney Docket No. 45288-0595WO1
[0020] By identifying, in the output sequence, a plurality of contiguous subsequences of timing tokens and identifying a corresponding respective text unit that is represented by text tokens that precede the contiguous subsequence, the described techniques ensure a direct, context-aware association between specific words and their occurrence in time. This prevents synchronization errors that often occur when alignment is calculated independently of the decoding process and allows a system implementing the generative neural network to respond to queries that require temporal alignment without performing a computationally-expensive post-processing phase.
[0021] Because each timing token can represent a respective symbol in an alphabet of symbols in a particular numeral system (e.g., Base-32 or Base-64), the described techniques achieve a compact representation of time that minimizes the overall length of the output sequence. Unlike schemes that might require a unique token for every possible timestamp or long sequences of decimal digits, a high-base numeral system allows precise timestamps to be encoded using very few tokens, thereby reducing the computational overhead and memory7required for the generative neural network to generate the sequence.
[0022] Existing techniques for using generative neural networks for processing speech and text often face difficulties in maintaining precise alignment between modalities over long sequences. For example, when training the generative neural network to generate audio from text (or vice versa), the generative neural network may struggle to determine exactly which portion of the input sequence corresponds to the current portion of the output sequence being generated. This lack of explicit alignment can lead to synchronization errors, such as skipping words during speech generation or hallucinating text during recognition. Furthermore, training these generative neural networks often requires complex, multi-stage pipelines or computationally expensive attention mechanisms to infer these relationships from global data, increasing the processing resources required for training.
[0023] This specification describes techniques that can address the aforementioned challenges. That is, the specification describes techniques that can train a generative neural network using training examples where text and audio are interleaved at the segment level to generate a ground truth output sequence. Specifically, the system partitions an audio sequence into segments, identifies the corresponding text for each segment, and constructs a ground truth output sequence where the text tokens for a specific segment immediately precede the audio tokens for that same segment. Alternatively, the system partitions a target text output, identifies the corresponding audio segments, and constructs a ground truth output sequence Attorney Docket No. 45288-0595WO1 where the audio tokens for a specific segment immediately precede the text tokens for that segment.
[0024] By identifying a partitioning of the audio sequence into a plurality of audio segments and identifying a text transcription of the audio segment, the described techniques break down complex, long-form data into manageable, locally aligned units. This reduces the computational complexity required for the generative neural network to leam the relationship between the audio and the text.
[0025] By generating a ground truth output sequence that includes a respective set of text tokens where the respective set of text tokens precede the respective set of audio tokens, the described techniques create a strong conditioning context for the neural network. The described techniques can train the generative neural network to “see” the text immediately before it attempts to generate the corresponding audio, thereby improving the accuracy of the synthesized audio tokens.
[0026] By identifying a partitioning of the target text output into a plurality of text segments and identifying an audio segment that represents a verbalization of the text segment, the described techniques ensure that the training data for the generative neural network explicitly maps specific sounds to specific words or phrases. This prevents the “alignment drift” often seen in existing end-to-end speech recognition models.
[0027] By generating a ground truth output sequence that includes a set of audio tokens where the set of audio tokens precede the respective set of text tokens, the described techniques train the generative neural network to process a specific chunk of audio and immediately predict the text for that chunk, thereby improving the accuracy of the synthesized text tokens.
[0028] Existing techniques that use generative neural networks to carry out communication sessions with users typically rely on a tum-taking architecture where a dialogue system processes input from a user system and generates output as mutually exclusive events. In these systems, the dialogue system must often w ait for the user system to complete a transmission before processing the audio tokens and generating a response. This creates significant latency and prevents natural conversational dynamics during the communication session, such as simultaneous speech or immediate barge-in handling. Additionally, because the generation process is often decoupled from the real-time input stream from the user system, the dialogue system cannot easily condition its ongoing speech output on the immediate acoustic context of the user system or the dialogue system's own recent output. Attorney Docket No. 45288-0595WO1
[0029] This specification describes techniques that can address the aforementioned challenges. That is, the specification describes techniques that can operate a communication session in continuous time steps where the dialogue system utilizes a generative neural network fed by a combined input. Specifically, at each time step, the dialogue system captures audio tokens from the user system and combines them with the output audio token generated by the dialogue system at the preceding time step. This combined input is processed by the generative neural network to produce the next increment of audio output, allowing the dialogue system to maintain a synchronized, full-duplex interaction.
[0030] By processing an input set of tokens that includes (i) the one or more audio tokens and (ii) an output audio token generated by the dialogue system at the preceding time step in the communication session, the described techniques create a feedback mechanism that allows the dialogue system to be simultaneously aware of external input (from the user system) and its own internal state (the token it just generated). This facilitates a fluid communication session where the dialogue system can naturally handle interruptions or speech overlap.
[0031] The details of one or more embodiments of the subject matter of this specification are set forth in the accompanying drawings and the description below.
[0032] According to a first aspect there is provided a method performed by one or more computers that includes receiving a network input that comprises audio data representing speech spoken during a given time window; processing the network input using a generative neural network to generate an output sequence of output tokens that represents a transcription of the speech, wherein each output token is selected from a vocabulary of tokens that includes a plurality7of text tokens and a plurality7of timing tokens; processing the output sequence of output tokens to generate a speech recognition output, comprising: identifying, in the output sequence, a plurality of contiguous subsequences of timing tokens; identifying, for each contiguous subsequence of timing tokens, a corresponding respective text unit that is represented by text tokens that precede the contiguous subsequence of timing tokens in the output sequence; determining, for each contiguous subsequence of timing tokens, a respective timestamp within the given time window that is represented by the contiguous subsequence of timing tokens; and providing, as the speech recognition output, (i) the text represented by the tokens in the subsequence and (ii) data that specifies that, for each contiguous subsequence of timing tokens, the corresponding text unit for the contiguous subsequence of timing tokens was spoken at the respective timestamp represented by the contiguous subsequence of timing tokens. Attorney Docket No. 45288-0595WO1
[0033] In some cases, the generative neural network is configured to include, within the output sequence, a respective contiguous subsequence of timing tokens after each subsequence of text tokens that represent a specific unit of text in a natural language.
[0034] In some cases, the specific unit of text is a word in the natural language.
[0035] In some cases, the generative neural network has been trained on training examples that each include (i) audio representing training speech, (ii) a ground truth transcription of the training speech, and (hi) respective timing data that specifies for each of one or more units of text within the ground truth transcription, a respective timestamp at which the unit of text was spoken.
[0036] In some cases, training the generative neural network on the training examples comprises, for each training example: generating, from the ground truth transcription of training speech and the respective timing data, a ground truth output sequence that includes, for each of the one or more units of text within the ground truth transcription, text tokens representing the unit of text follow ed by one or more timing tokens representing the respective timestamp at which the unit of text was spoken; and training the generative neural network using the audio in the training example and the ground truth output sequence for the training example.
[0037] In some cases, each timing token represents a respective symbol in an alphabet of symbols in a particular numeral system.
[0038] In some cases, determining, for each contiguous subsequence of timing tokens, a respective timestamp within the given time window7that is represented by the contiguous subsequence of timing tokens comprises: mapping each timing token in the subsequence to the respective symbol represented by the timing token; and decoding the respective symbols according to the particular numeral system to generate a timestamp.
[0039] In some cases, the particular numeral system is base-64.
[0040] In some cases, the particular numeral system is base-32.
[0041] In some cases, determining, for each contiguous subsequence of timing tokens, a respective timestamp within the given time window that is represented by the contiguous subsequence of timing tokens further comprises: determining whether the timestamp is outside of the given time window; and in response to determining that the timestamp is outside the time window7, modifying the timestamp to indicate a last timestamp that is within the given time window.
[0042] In some cases, determining, for each contiguous subsequence of timing tokens, a respective timestamp within the given time window that is represented by the contiguous Attorney Docket No. 45288-0595WO1 subsequence of timing tokens further comprises: determining whether the timestamp is earlier than a timestamp represented by an immediately preceding contiguous subsequence of timing tokens in the output sequence; and in response to determining that the timestamp is earlier than a timestamp represented by an immediately preceding contiguous subsequence of timing tokens in the output sequence, modifying the timestamp to indicate a timestamp that is no earlier than the timestamp represented by an immediately preceding contiguous subsequence of timing tokens in the output sequence.
[0043] In some cases, the generative neural network comprises an auto-regressive neural network that auto-regressively generates tokens from the vocabulary.
[0044] According to a second aspect there is provided the methods of the first aspect performed by a system comprising: one or more computers; and one or more storage devices communicatively coupled to the one or more computers, wherein the one or more storage devices store instructions that, when executed by the one or more computers, cause the one or more computers to perform the operations of the respective method.
[0045] According to a third aspect there is provided the methods of the first aspect performed by one or more non-transitory computer storage media storing instructions that when executed by one or more computers cause the one or more computers to perform the operations of the respective method.
[0046] According to a fourth aspect there is provided a method of training a generative neural network performed by one or more computers that includes obtaining a training example that comprises (i) a training input and (ii) a target output that comprises an audio sequence; identify ing a partitioning of the audio sequence into a plurality of audio segments; for each audio segment, identify ing a text transcription of the audio segment; generating a ground truth output sequence that comprises, for each audio segment, (i) a respective set of audio tokens representing the audio segment and (ii) a respective set of text tokens representing the text transcription of the audio segment, wherein the respective set of text tokens precede the respective set of audio tokens in the ground truth output sequence; and training the generative neural network using the training input and the ground truth output sequence.
[0047] In some cases of the fourth aspect, training the generative neural network using the training input and the ground truth output sequence comprises: training the generative neural network on a next token prediction objective.
[0048] In some cases of the fourth aspect, each respective set of audio tokens is immediately preceded by a start of audio token in the ground truth output sequence. Attorney Docket No. 45288-0595WO1
[0049] In some cases of the fourth aspect, each respective set of audio tokens is immediately followed by an end of audio token in the ground truth output sequence.
[0050] In some cases of the fourth aspect, one or more of the plurality audio segments each correspond to speech of a respective word and wherein the text transcription of the audio segment is a transcription of the respective word.
[0051] In some cases of the fourth aspect, one or more of the plurality audio segments each correspond to speech of a respective chunk of multiple words and wherein the text transcription of the audio segment is a transcription of the multiple words.
[0052] In some cases of the fourth aspect, the method further comprises: generating the respective sets of audio tokens representing the audio segments by applying an audio tokenizer to the audio sequence.
[0053] In some cases of the fourth aspect, the method further comprises: generating the respective set of text tokens by applying a text tokenizer to the transcription of the audio segment.
[0054] According to a fifth aspect there is provided a method of training a generative neural network performed by one or more computers that includes obtaining a training example that comprises (i) a training input and (ii) a target text output; identifying a partitioning of the target text output into a plurality' of text segments; for each text segment, identifying an audio segment that represents a verbalization of the text segment; generating a ground truth output sequence that comprises, for each text segment, (i) a respective set of text tokens representing the text segment and (ii) a respective set of audio tokens representing the corresponding audio segment, wherein the respective set of audio tokens precede the respective set of text tokens in the ground truth output sequence; and training the generative neural network using the training input and the ground truth output sequence.
[0055] In some cases of the fifth aspect, training the generative neural network using the training input and the ground truth output sequence comprises: training the generative neural network on a next token prediction objective.
[0056] In some cases of the fifth aspect, each respective set of audio tokens is immediately preceded by a start of audio token in the ground truth output sequence.
[0057] In some cases of the fifth aspect, each respective set of audio tokens is immediately followed by an end of audio token in the ground truth output sequence.
[0058] In some cases of the fifth aspect, one or more of the plurality audio segments corresponds to speech of a respective word. Attorney Docket No. 45288-0595WO1
[0059] In some cases of the fifth aspect, one or more the plurality audio segments correspond to speech of a respective chunk of multiple words and wherein the text transcription of the audio segment is a transcription of the multiple words.
[0060] In some cases of the fifth aspect, the method further comprises: generating the respective sets of audio tokens representing the audio segments by applying an audio tokenizer to the audio sequence.
[0061] In some cases of the fifth aspect, the method further comprises: generating the respective set of text tokens by applying a text tokenizer to the transcription of the audio segment.
[0062] According to a sixth aspect there is provided the methods of the fourth aspect or fifth aspect performed by a system comprising: one or more computers; and one or more storage devices communicatively coupled to the one or more computers, wherein the one or more storage devices store instructions that, when executed by the one or more computers, cause the one or more computers to perform the operations of the respective method.
[0063] According to a seventh aspect there is provided the methods of the fourth aspect or fifth aspect performed by one or more non-transitory computer storage media storing instructions that when executed by one or more computers cause the one or more computers to perform the operations of the respective method.
[0064] According to a eighth aspect there is provided a method performed by one or more computers that includes initializing a communication session between a dialogue system and a user system; and at each of a plurality of time steps during the communication session: obtaining one or more audio tokens representing audio received from the user system at the time step; processing an input set of tokens that comprises (i) the one or more audio tokens and (ii) an output audio token generated by the dialogue system at the preceding time step in the communication session to generate one or more input tokens to a generative neural network; and processing the one or more input tokens using the generative neural network to generate one or more output audio tokens for the time step.
[0065] In some cases of the eighth aspect, the generative neural network is an auto-regressive generative neural network that auto-regressively generates output tokens.
[0066] In some cases of the eighth aspect, the method further comprises: processing the one or more output audio tokens to generate audio for the time step; and providing the audio for the time step for playback at the user system.
[0067] In some cases of the eighth aspect, the input tokens to the neural network are vectors in an embedding space, and wherein processing the input set of tokens comprises: generating Attorney Docket No. 45288-0595WO1 a respective embedding of each token in the input set of tokens; and processing an input comprising the respective embeddings of the tokens in the input set of tokens using an inputoutput encoder neural network to generate, as output, an input token that is in the embedding space.
[0068] In some cases of the eighth aspect, the input comprising the respective embeddings of the tokens in the input set of tokens further comprises the respective input tokens at one or more preceding time steps during the communication session.
[0069] In some cases of the eighth aspect, the input set of tokens further comprises audio tokens representing audio received from the user system at one or more preceding time steps during the communication session.
[0070] In some cases of the eighth aspect, the method further comprises: at each of a plurality of time steps during the communication session: obtaining one or more video tokens representing video received from the user system at the time step, wherein the input set of tokens comprises (i) the one or more audio tokens and (ii) the output audio token generated by the dialogue system at the preceding time step, and (iii) the one or more video tokens.
[0071] In some cases of the eighth aspect, processing the input set of tokens comprises generating a sequence of the input set of tokens that interleaves the output audio token and the one or more audio tokens.
[0072] According to a ninth aspect there is provided the methods of the eighth aspect performed by a system comprising: one or more computers; and one or more storage devices communicatively coupled to the one or more computers, wherein the one or more storage devices store instructions that, when executed by the one or more computers, cause the one or more computers to perform the operations of the respective method.
[0073] According to a tenth aspect there is provided the methods of the eighth aspect performed by one or more non-transitory computer storage media storing instructions that when executed by one or more computers cause the one or more computers to perform the operations of the respective method.
[0074] Other features, aspects, and advantages of the subject matter will become apparent from the description, the drawings, and the claims.
[0075] BRIEF DESCRIPTION OF THE DRAWINGS
[0076] FIG. 1 A shows an inference system configured to incorporate timing information in speech recognition outputs. Attorney Docket No. 45288-0595WO1
[0077] FIG. IB shows a training system configured to train a generative neural network to generate outputs that interleave audio and text tokens, where text tokens precede audio tokens.
[0078] FIG. 1C shows attaining system configured to train a generative neural network to generate outputs that interleave audio and text tokens, where audio tokens precede text tokens.
[0079] FIG. ID shows an inference system configured to format inputs to the generative neural networks during dialogue sessions that include received audio (and optionally video) and generated audio at the same time step.
[0080] FIG. 2A is a flow diagram of an example process for incorporating timing information in speech recognition outputs.
[0081] FIG. 2B is a flow diagram of an example process for training generative neural network to generate output that interleaves audio and text tokens, where text tokens precede audio tokens.
[0082] FIG. 2C is a flow diagram of an example process for training a generative neural network to generate output that interleaves audio and text tokens, where audio tokens precede text tokens.
[0083] FIG. 2D is a flow diagram of an example process for formatting inputs to the generative neural networks during dialogue sessions that include received audio (and optionally video) and generated audio at the same time step.
[0084] FIG. 3 is a flow diagram of an example process for training a generative neural network.
[0085] FIG. 4 shows examples of the performance of the described techniques.
[0086] Like reference numbers and designations in the various drawings indicate like elements.
[0087] DETAILED DESCRIPTION OF THE DRAWINGS
[0088] FIG. 1A shows an example inference system 100. The system 100 is an example of a system implemented as computer programs on one or more computers in one or more locations, in which the systems, components, and techniques described below can be implemented.
[0089] The inference system 100 can use a neural network 102 to perform one or more tasks that require generating or processing audio. In some situations, the neural network 102 can thus be referred to as a generative neural network 102. In particular, the generative neural Attorney Docket No. 45288-0595WO1 network 102 generates output sequences of output tokens, where each output token is selected from a vocabulary of tokens that includes at least text tokens and audio tokens.
[0090] Description of examples of the generative neural network 102 now follows.
[0091] The generative neural network 102 is a neural network having parameters and that can be configured through training to process an input sequence that is made up of tokens from a vocabulary in accordance with the parameters to generate, based on the input sequence, an output sequence for a generative task that is made up of tokens from the vocabulary. For example, the input sequence can include a prompt that provides context for the output sequence.
[0092] After training, the inference system 100 or another system can deploy the generative neural network 102 on one or more computing devices to perform inference for the one or more generative tasks, i.e., to generate new output sequences for the generative tasks based on new input sequences.
[0093] The vocabulary of tokens can include any of a variety of tokens that represent text symbols or other symbols. For example, the vocabulary’ of tokens can include one or more of characters, sub-words, words, punctuation marks, numbers, or other symbols that appear in a corpus of natural language text and / or computer code.
[0094] Additionally, or alternatively, the vocabulary' of tokens can include tokens that can represent data other than text. For example, the vocabulary of tokens can include image tokens that represent a discrete set of image patch embeddings of an image that can be generated by an image encoder neural network based on processing the image patches of the image. As another example, the vocabulary’ of tokens can include audio tokens that represent code vectors in a codebook of a quantizer, e g., a residual vector quantizer.
[0095] In some implementations, the generative neural network 102 can be configured as an auto-regressive language model neural network. The language model neural network is referred to as an auto-regressive neural network when the language model neural network auto-regressively generates an output sequence of tokens by generating each particular token in the output sequence conditioned on a current input sequence that includes any (e.g., all) tokens that precede the particular token in the output sequence, i.e., tokens that have already been generated for any’ previous positions in the output sequence that precede the particular position of the particular token, and the input sequence.
[0096] For example, the current input sequence when generating a token at any given position in the output sequence can include the input sequence and the tokens at any preceding positions that precede the given position in the output sequence. As a particular Attorney Docket No. 45288-0595WO1 example, the current input sequence can include the input sequence followed by the tokens at any (e.g.. all) preceding positions that precede the given position in the output sequence. Optionally, the input sequence and the current output sequence can be separated by one or more predetermined tokens within the current input sequence.
[0097] More specifically, to generate a particular token at a particular position within an output sequence, the generative neural network 102 can process the current input sequence to generate a score distribution, e.g., a probability distribution, that assigns a respective score, e.g., a respective probability, to each token in the vocabulary of tokens. The generative neural network 102 can then select, as the particular token, a token from the vocabulary using the score distribution. For example, the generative neural network 102 can greedily select the highest-scoring token or can sample, e.g., using nucleus sampling or another sampling technique, a token from the distribution.
[0098] As a particular example, the generative neural network 102 can be or comprise an auto-regressive Transformer-based neural network that includes (i) a sequence comprising a plurality of attention blocks that each apply a self-attention operation and (ii) an output subnetwork that processes an output of the last attention block to generate the score distribution.
[0099] The generative neural network 102 can have any of a variety7of Transformer-based language model neural network architectures. Examples of such neural network architectures include those described in Colin Raffel, Noam Shazeer. Adam Roberts. Katherine Lee. Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv: 1910.10683, 2019; Daniel Adiwardana, Minh-Thang Luong, David R. So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, and Quoc V. Le. Towards a human-like open-domain chatbot. CoRR, abs / 2001.09977, 2020; Aakanksha Chowdhery, et al. PaLM: Scaling Language Modeling with Pathways, arXiv preprint arXiv:2204.02311; Rohan Anil, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023; Comanici, Gheorghe, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein et al. "Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities." arXiv preprint arXiv: 2507.06261 (2025), Team, Gemma, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin et al. "Gemma 3 technical report." arXiv preprint arXiv:2503. 19786 (2025), and Gemini Team, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023). Attorney Docket No. 45288-0595WO1
[0100] Generally, however, the Transformer-based language model neural network includes a sequence of atention blocks, and, during the processing of a given input sequence, each attention block in the sequence receives a respective input hidden state for each input token in the given input sequence. The attention block then updates at least the hidden state for the last token in a given input sequence at least in part by applying self-atention to generate a respective output hidden state for the last token. The input hidden states for the first atention block are embeddings of the input tokens in the input sequence and the input hidden states for each subsequent atention block are the output hidden states generated by the preceding atention block.
[0101] In this example, the output subnetwork processes the output hidden state generated by the last atention block in the sequence for the last input token in the input sequence to generate the score distribution.
[0102] As an example, the generative neural network 102 can generate text sequences, i.e., each output sequence generated by the generative neural network 102 is a sequence of text tokens from a vocabulary’ of text tokens that includes, e.g., one or more of characters, subwords, words, punctuation marks, numbers, or other symbols that appear in natural language text. For example, the inference system 100 can use the generative neural network 102 to generate text sequences and provide the text sequences for presentation to users.
[0103] As another example, the generative neural network 102 can generate images or videos that have multiple frames (where each frame is an image) by generating images, e.g., either as sequences of pixels or through an iterative denoising process. For example, the output sequence generated by the generative neural network 102 includes a plurality' of color values for pixels in an image arranged according to a specified order. As another example, the output sequence generated by the generative neural network 102 includes a plurality of tokens that represent image patch embeddings of an image which can then be processed by a decoder neural network to generate the image. For example, the inference system 100 can use the generative neural network 102 to generate an image or a video conditioned on an input sequence that includes a text description of the content of the image or the video.
[0104] As another example, the input sequence is a sequence of text and the output sequence is another sequence of text, e.g., a completion of the input sequence of text, a paraphrase of the input sequence of text, a response to a question posed in the input sequence, or a sequence of text that is about a topic specified by the input sequence of text. As another example, the input sequence can be an input other than text, e.g., a plurality of pixels included in an image, and the output sequence can be a text sequence that describes the input. Attorney Docket No. 45288-0595WO1
[0105] As another example, the input sequence represents data to be compressed, e.g., image data, text data, audio data, or any other type of data; and the output sequence is a compressed version of the data. The tokens included in the output sequence can include any representation of compressed data, e.g., symbols or embeddings to be decoded by a respective neural network.
[0106] As a particular example, the inference system 100 can be part of a dialog system and the input sequence can include audio or text from the most recent conversational turn submitted by a user of the dialog system during the dialog while the output sequence is the next turn in the conversation, e.g., either text or audio that is a response to the most recent conversational turn. Optionally, the input sequence can also include one or more historical conversational turns that occurred earlier in the conversation.
[0107] As another particular example, the inference system 100 can be part of a machine translation system and the input sequence can include text in a source language while the output sequence can include text in a target language that is a translation of the source text into the target language.
[0108] As another particular example, the inference system 100 can be part of a natural language processing system. For example, if the input sequence is a sequence of words in an original language, e.g., a sentence or phrase, the output sequence can be a summary' of the input sequence in the original language, i.e., a sequence that has fewer words than the input sequence but that retains the essential meaning of the input sequence. As another example, if the input sequence is a sequence of words that form a question, the output sequence can be a sequence of words that form an answer to the question.
[0109] As another particular example, the inference system 100 can be part of a computer- assisted medical diagnosis system. For example, the input sequence can be a sequence of data from an electronic medical record and the output sequences can each be a sequence of predicted treatments.
[0110] As another particular example, the inference system 100 can be part of a computer code generation system and the input sequence can include a text description of a desired piece of code or a snippet of computer code in a programming language and the output sequence can include computer code, e g., a snippet of code that is described by the input sequence or a snippet of code that follows the input sequence in a computer program.
[0111] As another particular example, the inference system 100 can be part of a multi-modal system that processes multi-modal input sequences, e.g., both text and image input sequences, or both text and audio input sequences, and generates the output sequences that Attorney Docket No. 45288-0595WO1 are either in a single data modality or in multiple data modalities, e.g., text and image output sequences, or text and audio output sequences. Examples of such multi-modal systems include an image captioning system, a text-based image search system, an image-based question answering system, and so on.
[0112] As another particular example, the inference system 100 can be part of or associated with a robotic control system, i.e., a system for controlling one or more mechanical agents. The input sequence can comprise a natural language description of one or more tasks for a the one or more mechanical agents and the output sequence can comprise a sequence of instructions (e.g., joint angles, torques, velocities, etc.) for the one or more mechanical agents that cause the one or more mechanical agents to perform the one or more tasks described in the input sequence.
[0113] In a similar example, the inference system 100 can be part of or associated with a control system in a manufacturing environment for manufacturing a product, i.e., a system for controlling a manufacturing unit or a machine that operates to manufacture the product. In another similar example, the inference system 100 can be part of or associated with a control system in a service facility comprising a plurality of items of electronic equipment.
[0114] As another particular example, the inference system 100 can be part of or associated with a search system that facilitates searching of resources on the Internet. A resource can be any data that can be provided over the Internet. A resource can be identified by a resource address that is associated with the resource. Resources include web pages, word processing documents, portable document format (PDF) documents, images, video, and news feed sources, to name a few.
[0115] In this particular example, the search system can receive search queries submitted by client devices and, in response, identity resources that are relevant to the search query in the form of search results and return the search results to the user devices in search results pages. A search result page can include search result data generated by the search system that identifies a resource responsive to a search query, and includes a link to the resource. The search result page can additionally include a result in the form of an output sequence that is generated by the inference system 100 based on an input sequence derived from the search query.
[0116] The generative neural network 102 is typically trained using a multi-stage approach: a pre-training stage followed by a fine-tuning stage. These stages can be performed by the inference system 100, another system (e.g., a training system), or both. As an example, the system 100 can receive data specifying a pre-trained generative neural network 102 from Attorney Docket No. 45288-0595WO1 another system (e.g., a training system), and then perform the fine-tuning of the pre-trained generative neural network.
[0117] In the pre-training stage, the generative neural network 102 is pre-trained by the inference system 100 or another system based on optimizing one or more unsupervised or self-super ised objective functions, e.g., a maximum-likelihood objective function, on one or more large datasets and then, in some cases, adjusted to the generative tasks, which can include any combination of one or more of the generative tasks mentioned below and possibly other tasks, through fine-tuning adaptation based on supervised learning, reinforcement learning from human feedback (RLHF), reinforcement learning from Al feedback (RLAIF). prompt tuning, instruction tuning, and the like, that use different training objectives, different datasets, or both.
[0118] The one or more large datasets used during the pre-training stage can include a large dataset of text in one or more natural languages, e.g., text that is publicly available from the Internet or another text corpus, a large dataset of computer code in one or more programming languages, e.g., Python, C++, C#, Java, Ruby, PHP, and so on, e.g., computer code that is publicly available from the Internet or another code repository, a large dataset of audio samples, e.g., audio recordings or waveforms that represent the audio recordings, a large dataset of images where each image includes an array of pixels, a large dataset of videos where each video includes a temporal sequence of frames, or a large multi-modal dataset that includes a combination of two or more of these datasets.
[0119] The inference system 100 of FIG. 1 A is configured to incorporate timing information in speech recognition outputs.
[0120] For example, the system 100 can perform automatic speech recognition by processing audio data to generate an output sequence of output tokens that includes both text tokens representing spoken words and timing tokens representing the specific times those words were spoken. Then, by decoding these timing tokens, the system 100 can provide a text transcription coupled with precise timestamps for each word or otherwise respond to queries that require identifying timestamps for certain text in the text transcript.
[0121] More specifically, the system 100 receives a network input 104 that includes audio data representing speech spoken during a given time window.
[0122] The audio data can be a digital representation of an audio signal containing spoken language. For example, the audio data can include a sequence of audio frames (e.g., log-mel filterbank energies) corresponding to a specific duration of audio, such as a 0.3, 3, or 30- second audio segment. Attorney Docket No. 45288-0595WO1
[0123] The system 100 processes the network input 104 using a generative neural network 102 to generate an output sequence 106 of output tokens that represents a transcription of the speech. Each output token is selected from a vocabulary of tokens that includes a plurality of text tokens and a plurality of timing tokens.
[0124] As described above, text tokens can be sub-word units, characters, or whole words that form the semantic content of the speech. For example, text tokens can include standard alphanumeric characters, punctuation marks, and partial word stems common in natural languages.
[0125] Timing tokens can be a designated subset of the vocabulary reserved specifically for representing numerical timing values. For example, the timing tokens can be tokens that do not appear in text transcripts but are numerals. For example, a timing token can be represented as “<ctrl 1>“ which does not appear in text transcripts.
[0126] In some cases, the generative neural network 102 includes an auto-regressive neural network that auto-regressively generates tokens from the vocabulary7. In such cases, for example, the auto-regressive neural network can process the audio data to generate audio embeddings and sequentially predict a next token in the output sequence based on the audio embeddings and previously generated tokens in the output sequence.
[0127] In some cases, the generative neural network 102 is configured to include, within the output sequence 106, a respective contiguous subsequence of timing tokens after each subsequence of text tokens that represent a specific unit of text in a natural language. The specific unit of text can be a word in the natural language.
[0128] As an example, for the output sequence “Hello<ctrl36> world!<ctrll><ctrll2>“, “Hello'’ and “world!'’ are each units of text; “<ctrl36>“, “<ctrll>“ and “<ctrll2>“ are timing tokens; and “<ctrl36>“ and “<ctrll><ctrll2>“ are each a contiguous subsequence of timing tokens after a subsequence of text tokens.
[0129] For example, each timing token can represent a respective symbol in an alphabet of symbols in a particular numeral sy stem. The particular numeral system can have any base, such as base-64 or base-32. For example, the system 100 can utilize a base-64 counting scheme where specific timing tokens map to values 0 through 63, allowing a short sequence of tokens to represent large numbers.
[0130] For example, the timing token “<ctrl36>“ can map to the integer value 36 in a base 64 numeral system.
[0131] In some implementations, the generative neural network 102 has been trained on training examples that each include (i) audio representing training speech, (ii) a ground truth Attorney Docket No. 45288-0595WO1 transcription of the training speech, and (iii) respective timing data that specifies for each of one or more units of text within the ground truth transcription, a respective timestamp at which the unit of text was spoken. In other words, the system 100 or another system trains the generative neural network 102 using training examples where transcribed spoken text and the exact time each word ends are known.
[0132] Further details of training the generative neural network 102 on these training examples are described below.
[0133] The system 100 then processes the output sequence 106 of output tokens to generate a speech recognition output 108.
[0134] Generally, the speech recognition output 108 includes the transcribed text 110 along with data 112 indicating the specific time at which that text 110 was spoken.
[0135] For example, if the output sequence 106 contains the string 'Hello<\ctrl36>', the speech recognition output 108 provided by the system 100 includes the text ‘Hello’ associated with a specific time value calculated from the timing token ‘<\ctrl36>‘, indicating that the word ‘Hello’ was spoken at that specific timestamp in the audio data representing speech spoken.
[0136] To process the output sequence 106 of output tokens to generate a speech recognition output 108, the system 100 identifies, in the output sequence 106, a plurality of contiguous subsequences of timing tokens.
[0137] For example, in the output sequence 106 “Hello<ctrl36> world!<ctrll><ctrll2>“, the system 100 can identify “<ctrl36>“ and “<ctrl 1 Xctrl 12>“ as each being contiguous subsequence of timing tokens.
[0138] The system 100 then identifies, for each contiguous subsequence of timing tokens, a corresponding respective text unit that is represented by text tokens that precede the contiguous subsequence of timing tokens in the output sequence 106.
[0139] For example, for the contiguous subsequence “<\ctrl36>“, the system 100 identifies the text token “Hello” as the corresponding respective text unit that is represented by text tokens that precede the contiguous subsequence of timing tokens
[0140] Afterwards, the system 100 determines, for each contiguous subsequence of timing tokens, a respective timestamp within the given time window that is represented by the contiguous subsequence of timing tokens.
[0141] For example, the system 100 calculates that the token “<\ctrl36>“ represents a specific time value within the audio segment. Attorney Docket No. 45288-0595WO1
[0142] In some cases, to determine, for each contiguous subsequence of timing tokens, a respective timestamp within the given time window that is represented by the contiguous subsequence of timing tokens, the system 100 can map each timing token in the subsequence to the respective symbol represented by the timing token. Then the system 100 can decode the respective symbols according to the particular numeral system to generate a timestamp.
[0143] For example, if the token "<\ctrl36>” corresponds to the integer 36. and each unit represents 40 milliseconds, the system 100 decodes the timestamp as 36 multiplied by 40ms. resulting in 1440ms.
[0144] In some cases, the system 100 determines whether the timestamp is outside of the given time window and, in response to determining that the timestamp is outside the time window, modifies the timestamp to indicate a last timestamp that is within the given time window.
[0145] For example, if the generated timestamp corresponds to 31 seconds but the audio input was only 30 seconds long, the system 100 caps the timestamp at 30 seconds to ensure it remains valid.
[0146] In some other cases, the system 100 determines whether the timestamp is earlier than a timestamp represented by an immediately preceding contiguous subsequence of timing tokens in the output sequence. Then, in response to determining that the timestamp is earlier than a timestamp represented by an immediately preceding contiguous subsequence of timing tokens in the output sequence 106. the system 100 modifies the timestamp to indicate a timestamp that is no earlier than the timestamp represented by an immediately preceding contiguous subsequence of timing tokens in the output sequence 106.
[0147] For example, if the timestamp for a current word is generated as 1.2 seconds, but the timestamp for the immediately preceding word was 1.3 seconds, the system 100 corrects the current timestamp to 1.3 seconds to prevent chronological errors.
[0148] The system 100 then provides, as the speech recognition output 108, (i) the text 110 represented by the tokens in the subsequence and (ii) data 112 that specifies that, for each contiguous subsequence of timing tokens, the corresponding text unit for the contiguous subsequence of timing tokens was spoken at the respective timestamp represented by the contiguous subsequence of timing tokens.
[0149] In some cases, prior to the inference system 100 performing the operations described above, a training system can train the generative neural network 102 used by the inference system 100. For example, training system 101 or training system 103 (described below) can Attorney Docket No. 45288-0595WO1 train the generative neural network 102 prior to the inference system 100 performing the operations described above.
[0150] FIG. IB shows an example training system 101. The system 101 is an example of a system implemented as computer programs on one or more computers in one or more locations, in which the systems, components, and techniques described below can be implemented.
[0151] The training system 101 of FIG. IB is configured to train a generative neural network to generate outputs that interleaves audio and text tokens, where text tokens precede audio tokens.
[0152] In particular, the system 101 obtains a training example 114 that includes (i) a training input 116 and (ii) a target output 118 that includes an audio sequence.
[0153] The training input 116 can be data serving as a prompt or context for the generative neural network 102, such as a sequence of input tokens of any modality that precedes the target output 118. For example, because the generative neural network 102 can be a multimodal model, the training input 116 can include text tokens, audio tokens, video tokens, or any combination thereof. For example, the training input 116 can be a textual, audio, or visual context, such as a user query in a dialogue (e.g., ‘'How are you?”) or a specific instruction to the generative neural network 102, which conditions the generative neural network 102 to generate a subsequent response.
[0154] The audio sequence of the target output 118 can be a waveform or a stream of audio data. For example, the audio sequence can be an audio waveform, or a sequence of discrete audio frames of the waveform.
[0155] In some cases, the training example 114 includes alignment information. Alignment information can include data that specifies the temporal correspondence between units of text in a transcription of the audio sequence. For example, the alignment information can include word-level timings that indicate the start time and end time of each word or phrase within the audio sequence, identifying which specific time interval of the audio corresponds to a specific unit of text.
[0156] As an example, the system 101 can obtain a training example 114 from a database of annotated media, a repository of video content, or logged interactions from a dialogue system. For example, the system can retrieve a video segment with an associated audio and text transcript, or accesses a dataset containing paired audio and text data.
[0157] The system 101 identifies a partitioning of the audio sequence into a plurality of audio segments 120. Attorney Docket No. 45288-0595WO1
[0158] For example, the system 101, can identify a partitioning of the audio sequence into audio segments 120 using alignment information included in the training example 114. For example, the system 101 can utilize timestamps provided in the alignment information to determine the start time and end time for dividing the audio sequence into distinct segments 120, where each segment corresponds to a specific annotated unit of text or a defined duration of speech.
[0159] In some cases, one or more of the plurality audio segments 120 each correspond to speech of a respective word. For instance, the audio sequence is segmented such that each segment 120 captures the vocalization of a single word in a natural language.
[0160] In some cases, one or more of the plurality of audio segments 120 each corresponds to the speech of a respective chunk of multiple words. For example, the system 101 may partition the audio sequence into fixed-duration audio segments 120 (e.g., 0.3, 3 or 30-second segments) or semantic phrases that encompass multiple spoken words.
[0161] The system 101, for each audio segment 120, identifies a text transcription of the audio segment 120.
[0162] For example, the system 101 can identify the text transcription by mapping the temporal boundaries of the audio segment 120 to the corresponding text transcript included in training example 114 using alignment information also included in the training example 114. For example, the system 101 can extract the specific word or sequence of words from the transcript that are annotated as having been spoken during the time interval defined by the start and end times of the respective audio segment 120.
[0163] The system 101 generates a ground truth output sequence 122 that includes, for each audio segment 120, (i) a respective set of audio tokens representing the audio segment and (ii) a respective set of text tokens representing the text transcription of the audio segment 120. The respective set of text tokens precede the respective set of audio tokens in the ground truth output sequence 122.
[0164] In some cases, the system 101 configures the ground truth output sequence 122 such that each respective set of audio tokens is immediately preceded by a start of audio token. For example, the output sequence can include the tokens ‘'<start_of_audio> <audio_token_l> <audio_token_2> <audio_token_3>“ where “<start_of_audio>‘’ is the start of audio token.
[0165] Furthermore, in some cases, each respective set of audio tokens is immediately followed by an end of audio token in the ground truth output sequence 122. For example, the output sequence can include the tokens ”<start of audios <audio token 1> Attorney Docket No. 45288-0595WO1
[0166] <audio_token_2> <audio_token_3> <end_of_audio>“ where “<ertd_of_audio>“ is the end of audio token.
[0167] In some implementations, the system 101 can generate the respective sets of audio tokens representing the audio segments 120 by applying an audio tokenizer to the audio sequence. For example, the audio tokenizer can process the waveform of the audio segment 120 in a streaming manner to convert it into a sequence of discrete hard tokens that represent the audio data.
[0168] As used herein, ‘hard tokens’ (or ‘discrete hard tokens’) refer to discrete, quantized representations of data selected from a finite vocabulary', such as integer indices from a learned codebook (e.g., a Vector Quantized Variational Autoencoder (VQ-VAE) codebook). This is in contrast to ‘soft tokens’ or continuous representations (e.g.. dense vectors or embeddings) which can include continuous floating-point values. For example, an audio tokenizer can process a continuous audio waveform to generate a sequence of hard tokens, where each token is a specific integer (e g., ranging from 0 to 256,000) that corresponds to a nearest-neighbor vector in the model’s fixed codebook.
[0169] In some implementations, the system 101 can generate the respective set of text tokens by applying a text tokenizer to the transcription of the audio segment 120.
[0170] The system 101 trains the generative neural network 102 using the training input 116 and the ground truth output sequence 122.
[0171] In some cases, the system 101 can train the generative neural network 120 on a next token prediction objective.
[0172] Further details of training the generative neural network are described below with reference to FIG. 3.
[0173] FIG. 1C shows an example training system 103. The system 103 is an example of a system implemented as computer programs on one or more computers in one or more locations, in which the systems, components, and techniques described below can be implemented.
[0174] The training system 103 of FIG. 1C is configured to train a generative neural network to generate outputs that interleave audio and text tokens, where audio tokens precede text tokens.
[0175] The system 103 obtains a training example 124 that includes (i) a training input 126 and (ii) a target text output 128.
[0176] The training input 126 can be data serving as a prompt or context for the generative neural network 102, such as a sequence of input tokens of any modality’ that precedes the Attorney Docket No. 45288-0595WO1 target text output 128. For example, because the generative neural network 102 can be a multi-modal model, the training input 126 can include text tokens, audio tokens, video tokens, or any combination thereof. For example, the training input 126 can be a textual, audio, or visual context, such as a user query in a dialogue (e.g., ‘’How are you?”) or a specific instruction to the generative neural network 102, which conditions the generative neural network 102 to generate a subsequent response.
[0177] The target text output 128 is a textual transcription. For example, the target text output 128 is a sequence of words representing a transcript of an audio component included in the training example 124 (e.g., as associated audio data).
[0178] In some cases, the training example 124 includes alignment information. Alignment information can include data that specifies the temporal correspondence between units of text in the textual transcription and an audio sequence corresponding to the textual transcription. For example, the alignment information can include word-level timings that indicate the start time and end time of each word or phrase within the audio sequence, identifying which specific time interval of the audio corresponds to a specific unit of text.
[0179] As described above, the system 103 can obtain a training example 124 from any of a variety of appropriate sources. For example, the system 103 can obtain the training example 124 from a database of annotated media, a repository of video content, or logged interactions from a dialogue system.
[0180] The system 103 identifies a partitioning of the target text output 128 into a plurality of text segments 130.
[0181] For example, the system 103 can identify7a partitioning by utilizing alignment information that maps specific time intervals of the audio to corresponding units of text. Specifically, the system 103 uses the alignment information to determine the boundaries for dividing the target text output 128 into text segments 130 corresponding to specific granularities, such as individual words or phrases.
[0182] The system 103, for each text segment 130, identifies an audio segment that represents a verbalization of the text segment 130.
[0183] In some cases, one or more of the plurality audio segments can correspond to speech of a respective word. For instance, the system 103 identifies a slice of the audio waveform that corresponds specifically to the pronunciation of the single word identified in a respective text segment 130. Attorney Docket No. 45288-0595WO1
[0184] In some cases, one or more the plurality audio segments can correspond to speech of a respective chunk of multiple words and the text transcription of the audio segment can be a transcription of the multiple words.
[0185] For example, the system 103 identifies an audio segment corresponding to a 30- second duration or a complete semantic phrase, and the text segment includes the sequence of words spoken during that interval.
[0186] The system 103 generates a ground truth output sequence 132 that includes, for each text segment 130, (i) a respective set of text tokens representing the text segment 130 and (ii) a respective set of audio tokens representing the corresponding audio segment. The respective set of audio tokens precede the respective set of text tokens in the ground truth output sequence 132.
[0187] In some cases, each respective set of audio tokens is immediately preceded by a start of audio token in the ground truth output sequence 132. Furthermore, in some cases, each respective set of audio tokens is immediately followed by an end of audio token in the ground truth output sequence 132.
[0188] In some implementations, the system 103 can generate the respective sets of audio tokens representing the audio segments by applying an audio tokenizer to the audio sequence. For example, the audio tokenizer processes the raw7waveform of the identified audio segment to convert it into a stream of discrete hard tokens.
[0189] In some implementations, the system 103 can generate the respective set of text tokens by applying a text tokenizer to the transcription of the audio segment (i.e., the text segment 130). For example, the system 103 converts the word or phrase of the text segment 130 into a sequence of text tokens using a tokenizer vocabulary.
[0190] The system 103 trains the generative neural network 102 using the training input 126 and the ground truth output sequence 132. In some cases, the system 103 can train the generative neural network 102 on a next token prediction objective.
[0191] FIG. ID shows an example inference system 105. The system 105 is an example of a system implemented as computer programs on one or more computers in one or more locations, in which the systems, components, and techniques described below can be implemented.
[0192] Inference system 105 of FIG. ID is configured to format inputs to the generative neural networks during dialogue sessions that include received audio (and optionally video) and generated audio at the same time step. Attorney Docket No. 45288-0595WO1
[0193] For example, the system 105 can facilitate a full-duplex communication session where it simultaneously processes incoming user audio and generates outgoing system audio. By feeding its own previous output back into the input stream alongside the user’s current input, the system 105 maintains a continuous, synchronized conversational flow without requiring explicit turn-taking.
[0194] In particular, the system 105 initializes a communication session between a dialogue system 134 and a user system 136.
[0195] The dialogue system 134 can be, for example, a server-based or on-device computer system implementing the generative model, which is capable of processing and generating audio streams. For example, the dialogue system 134 can be a cloud server running the generative neural network trained to handle audio, text, and video inputs simultaneously.
[0196] The user system 136 can be, for example, a client device equipped with audio capture (microphone) and playback (speaker) capabilities. For example, the user system 136 can be a mobile phone, a smart home assistant, or a personal computer that captures the user’s speech and environment. In some cases, the system 105 runs on the user system 136. More specifically, in some cases, the system 105 and the dialogue system 134 are both implemented as one or more computer programs on the user system 136.
[0197] The communication session can therefore be, for example, a synchronous exchange of data streams where the system and a user can speak and listen at the same time (full-duplex). The "‘time steps” in the communication session can refer to discrete intervals of time at which the continuous data streams are sampled and processed.
[0198] The system 105, at each of a plurality of time steps during the communication session, performs the following operations.
[0199] The system 105 obtains one or more audio tokens 140 representing audio received from the user system 136 at the time step.
[0200] For example, the system 105 can obtain these audio tokens 140 by receiving the audio waveform captured by the user system 136 and processing it through an audio tokenizer. The tokenizer can convert the continuous waveform into a sequence of hard tokens that represent the audio for that specific time step.
[0201] For example, the system 105 receives a 40ms chunk of audio from the user’s microphone, applies echo cancellation to remove the system’s own audio, and converts the remaining user speech into a specific integer value. Attorney Docket No. 45288-0595WO1
[0202] In some implementations, at each of the plurality of time steps during the communication sessions, the system 105 can obtain one or more video tokens representing video received from the user system 136 at the time step.
[0203] The system 105 can obtain these tokens by receiving video frames captured by the user system 136 and process them through a video tokenizer. Similar to audio, the video tokenizer converts the visual data into a set of discrete tokens.
[0204] For example, the system 105 can take a current video frame from a user’s camera and convert it into a sequence of hard tokens representing the visual information for that time step.
[0205] The system 105 processes an input set of tokens that includes (i) the one or more audio tokens 140 and (ii) an output audio token 138 generated by the dialogue system 134 at the preceding time step in the communication session to generate one or more input tokens 142 to a generative neural network 102.
[0206] For example, the system 105 can, for each of the one or more audio tokens 140, concatenate the audio token and the output audio token 138 generated by the dialogue system 134 at the preceding time step in the communication session and process the concatenation using a multi-layer perceptron to generate a respective input token 142 to the generative neural network 102.
[0207] In some cases, the multi-layer perceptron is trained jointly with the generative neural network 102. For example, during training of the generative neural network 102. the parameters of the multi-layer perceptron can be updated via backpropagation based on the loss computed for the output of the generative neural network 102 (e.g., a next-token prediction loss).
[0208] In some cases, the input tokens 142 to the generative neural network 102 are vectors in an embedding space. In these cases, to process the input set of tokens, the system 105 generates a respective embedding of each token in the input set of tokens. The system 105 then processes an input that includes the respective embeddings of the tokens in the input set of tokens using an input-output encoder neural network to generate, as output, an input token 142 that is in the embedding space.
[0209] For example, the system 105 can, for each of the one or more audio tokens 140, concatenate the audio token and the output audio token 138 generated by the dialogue system 134 at the preceding time step in the communication session and process the concatenation using an encoder neural network to generate a respective embedding. Then, the system can Attorney Docket No. 45288-0595WO1 process the respective embedding using an input-output encoder neural network to generate an updated embedding that is an input token 142 that is in the embedding space.
[0210] In some cases, the input that includes the respective embeddings of the tokens in the input set of tokens further includes the respective input tokens 142 at one or more preceding time steps during the communication session. For example, the input can further include the respective input tokens 142 at all preceding time steps during the communication session.
[0211] In some cases, the input set of tokens further includes audio tokens 140 representing audio received from the user system 136 at one or more preceding time steps during the communication session. For example, the input can further include the audio tokens representing audio received from the user system 136 at all preceding time steps during the communication session.
[0212] In some implementations, to process the input set of tokens, the system 105 generates a sequence of the input set of tokens that interleaves the (ii) output audio token 138 and the (i) one or more audio tokens 140.
[0213] For example, given an output audio token 138 generated by the dialogue system 134 at the preceding time step (denoted as yt_i) for the current time step t and one audio token 140 representing audio received from the user system 136 at the time step t (denoted as x_audiot). the system 105 can generate the sequence [yt-i, x_audiot] for each time step.
[0214] In some implementations, the input set of tokens can include (i) the one or more audio tokens 140 and (ii) the output audio token 138 generated by the dialogue system 134 at the preceding time step, and (iii) the one or more video tokens.
[0215] For example, continuing with the previous example and given one video token representing video received from the user system 136 at the time t (denoted as x_videot). the system 105 can generate the sequence [yt_i, x_audiot. x_videot] for each time step.
[0216] The system 105 processes the one or more input tokens 142 using the generative neural network 102 to generate one or more output audio tokens 144 for the time step.
[0217] In some cases, the generative neural network 102 is an auto-regressive generative neural network that auto-regressively generates output tokens.
[0218] In some cases, after the system 105 processes input token(s) 142 using the generative neural network 102 to generates output audio token(s) 144 for the time step, the system 105 processes the one or more output audio tokens 144 to generate audio for the time step. The system 105 then provides the audio for the time step for playback at the user system 136. For Attorney Docket No. 45288-0595WO1 example, the system 105 converts the generated one or more output audio tokens 144 into an audio waveform and streams it to a user's device for immediate playback.
[0219] FIG. 2A is a flow diagram of an example process 200 for incorporating timing information in speech recognition outputs. For convenience, the process will be described as being performed by a system of one or more computers located in one or more locations. For example, an inference system, e.g., the inference system 100 of FIG. 1A, appropriately programmed in accordance with this specification, can perform the process 200.
[0220] The system receives a network input that includes audio data representing speech spoken during a given time window7(step 202).
[0221] As described above, the audio data can be a digital representation of an audio signal, such as a waveform or a sequence of acoustic feature frames (e.g., log-mel filterbank energies) corresponding to specific time intervals (e.g., 40 milliseconds). In some cases, the audio data is formatted to be compatible with an audio encoder, such as a Conformer encoder, which can transform the input frames into a sequence of continuous audio embeddings.
[0222] The system processes the network input using a generative neural network to generate an output sequence of output tokens that represents a transcription of the speech (step 204). In some cases, to initiate this specific task, the generative neural network receives a distinct prompt included in the network input, such as “Transcribe the following audio with word timestamps <audio>“. Each output token is selected from a vocabulary of tokens that includes a plurality of text tokens and a plurality of timing tokens.
[0223] The vocabulary7of tokens can be derived from a pre-defined tokenizer associated with the generative neural network. The text tokens can include standard linguistic units like words, sub-words, or characters used for transcription. The timing tokens can be selected from a dedicated set of pre-defined tokens that exist within the tokenizer vocabulary but do not appear in standard speech recognition transcripts.
[0224] In some cases, each timing token represents a respective symbol in an alphabet of symbols in a particular numeral system.
[0225] The particular numeral system can be a base-n system, where n is any positive integer. For example, the particular numeral system can be base-2, base-8, base-10, base-16, base-32, or base-64. Using a higher base, such as base-64, allows the system to reduce the timestamp token count.
[0226] In some cases, the generative neural network includes an auto-regressive neural network that auto-regressively generates tokens from the vocabulary. As described above, the Attorney Docket No. 45288-0595WO1 auto-regressive neural network can process the audio data (e.g., as audio embeddings derived from an encoder) and sequentially predict a next token in the output sequence based on the audio embeddings and previously generated tokens in the output sequence.
[0227] In some cases, the generative neural network is configured to include, within the output sequence, a respective contiguous subsequence of timing tokens after each subsequence of text tokens that represent a specific unit of text in a natural language.
[0228] In some cases, the specific unit of text is a word in the natural language. For example, for the English word “Hello”, the system generates the text tokens for “Hello” followed immediately by timing tokens, for example, “<\ctrl36>“ where “<\ctrl36>“ is a timing token.
[0229] In some cases, the specific unit of text can be a pause indicator. For example, the system can mark a pause with empty brackets “{}” and append a timestamp to the brackets (e.g., “ { } <\ctrl49>“), which can improve the timing token selection accuracy for the following word generated by the system.
[0230] In some implementations, the generative neural network has been trained on training examples that each include (i) audio representing training speech, (ii) a ground truth transcription of the training speech, and (iii) respective timing data that specifies for each of one or more units of text within the ground truth transcription, a respective timestamp at which the unit of text was spoken.
[0231] As described above, to perform this training, the system (or another training system) can utilize datasets containing these training examples and that sometimes include alignment information.
[0232] In some cases, to train the generative neural network on these training examples, the system generates a truth output sequence and trains the generative neural network on the ground truth output sequence and audio in the training example. That is, the system generates, from the ground truth transcription of training speech and the respective timing data, a ground truth output sequence that includes, for each of the one or more units of text within the ground truth transcription, text tokens representing the unit of text follow ed by one or more timing tokens representing the respective timestamp at which the unit of text was spoken. For example, if the word “Hello” ends at 1.44 seconds (1440 ms) and the system uses 40ms frame precision, the system calculates the time step (1440 / 40 = 36) and generates the ground truth sequence “Hello<\ctrl36>“. The system then trains the generative neural network using the audio in the training example and the ground truth output sequence for the training example. For example, the system calculates a loss function (e.g., cross-entropy loss) Attorney Docket No. 45288-0595WO1 comparing the generated tokens to the ground truth tokens and updates the weights of the generative neural network via backpropagation and gradient descent to minimize this loss.
[0233] Further details of training the generative neural network are described below with reference to FIG. 3.
[0234] The system processes the output sequence of output tokens to generate a speech recognition output (step 206).
[0235] To process the output sequence of output tokens to generate a speech recognition output, the system identifies, in the output sequence, a plurality of contiguous subsequences of timing tokens.
[0236] The system then identifies, for each contiguous subsequence of timing tokens, a corresponding respective text unit that is represented by text tokens that precede the contiguous subsequence of timing tokens in the output sequence.
[0237] Afterwards, the system determines, for each contiguous subsequence of timing tokens, a respective timestamp within the given time window that is represented by the contiguous subsequence of timing tokens.
[0238] In some implementations, to determine the respective timestamp, the system maps each timing token in the subsequence to the respective symbol represented by the timing token. The system then decodes the respective symbols according to the particular numeral system to generate a timestamp.
[0239] For example, given the timing tokens “<\ctrll>“ and “<\ctrl 12>“ that correspond to the numeral system of base-64, the system can map each timing token to respective symbol (e.g., the values 1 and 12) and decode the resulting symbols to generate the timestamp (e.g., by calculating [1 * 64 + 12] = 76, and multiplying by a 40ms frame duration to yield 3040ms or 3.04 seconds).
[0240] As described above, in some cases, the system determines whether the timestamp is outside of the given time window and, in response to determining that the timestamp is outside the time window, modifies the timestamp to indicate a last timestamp that is within the given time window.
[0241] As described above, in some other cases, the system determines whether the timestamp is earlier than a timestamp represented by an immediately preceding contiguous subsequence of timing tokens in the output sequence. Then, in response to determining that the timestamp is earlier than a timestamp represented by an immediately preceding contiguous subsequence of timing tokens in the output sequence, the system modifies the Attorney Docket No. 45288-0595WO1 timestamp to indicate a timestamp that is no earlier than the timestamp represented by an immediately preceding contiguous subsequence of timing tokens in the output sequence.
[0242] The system then provides, as the speech recognition output, (i) the text represented by the tokens in the subsequence and (ii) data that specifies that, for each contiguous subsequence of timing tokens, the corresponding text unit for the contiguous subsequence of timing tokens was spoken at the respective timestamp represented by the contiguous subsequence of timing tokens.
[0243] FIG. 2B is a flow diagram of an example process 208 for training generative neural network to generate output that interleaves audio and text tokens, where text tokens precede audio tokens. For convenience, the process 300 will be described as being performed by a system of one or more computers located in one or more locations. For example, a training system, e.g., the training system 101 of FIG. IB, appropriately programmed in accordance with this specification, can perform the process 208.
[0244] The system obtains a training example that includes (i) a training input and (ii) a target output that includes an audio sequence (step 210).
[0245] As described above, the training input can be data serving as a prompt or context for the generative neural netw ork, such as a user query or a specific instruction.
[0246] As described above, the audio sequence can be a w aveform or a stream of audio data.
[0247] The system identifies a partitioning of the audio sequence into a plurality of audio segments (step 212).
[0248] As described above, the system can utilizes alignment information (e.g., timestamps) included in the training example to determine the start time and end time for dividing the audio sequence into distinct segments.
[0249] In some cases, one or more of the plurality audio segments each correspond to speech of a respective word and w herein the text transcription of the audio segment is a transcription of the respective word.
[0250] For example, the system segments the audio sequence representing the phrase “My daughter’ such that one segment corresponds exactly to the waveform of the spoken word “My” and the next segment corresponds to the waveform of the spoken word “daughter”.
[0251] In some cases, one or more of the plurality audio segments each correspond to speech of a respective chunk of multiple w ords and the text transcription of the audio segment is a transcription of the multiple words.
[0252] For example, the system partitions the audio sequence into chunks comprising a couple of words at a time (e.g., “My daughter”) rather than word-by-word. Attorney Docket No. 45288-0595WO1
[0253] The system, for each audio segment, identifies a text transcription of the audio segment (step 214).
[0254] As described above, the system can map the temporal boundaries of the audio segment to the corresponding text transcript included in the training example to extract the specific text spoken during that interval.
[0255] The system generates a ground truth output sequence that includes, for each audio segment, (i) a respective set of audio tokens representing the audio segment and (ii) a respective set of text tokens representing the text transcription of the audio segment (step 216). The respective set of text tokens precede the respective set of audio tokens in the ground truth output sequence.
[0256] In some cases, each respective set of audio tokens is immediately preceded by a start of audio token in the ground truth output sequence. In some cases, each respective set of audio tokens is immediately followed by an end of audio token in the ground truth output sequence.
[0257] In some implementations, the system generates the respective sets of audio tokens representing the audio segments by applying an audio tokenizer to the audio sequence. Examples of audio encoders include Conformer encoders configured to convert audio waveforms into discrete hard tokens. For example, the system can process the waveform for the word “daughter’" through the audio tokenizer to produce corresponding discrete audio tokens by slicing the continuous waveform into fixed time steps (e.g., frames) and mapping the acoustic features of each frame to a specific integer code from a learned codebook.
[0258] In some implementations, the system generates the respective set of text tokens by applying a text tokenizer to the transcription of the audio segment. Examples of text tokenizers include the SentencePiece, or Byte-Pair Encoding (BPE) tokenizers. For example, the system can process the text string “daughter” through the text tokenizer to produce corresponding discrete text tokens by looking up the word (or its sub-word constituents) in a pre-defined vocabulary and assigning it the corresponding unique integer identifier used by the model.
[0259] The system trains the generative neural network using the training input and the ground truth output sequence (step 218).
[0260] As described above, in some cases, training the generative neural network incudes training it on a next token prediction objective. For example, the system calculates a loss function (e.g., cross-entropy loss) comparing the generated tokens to the ground truth tokens Attorney Docket No. 45288-0595WO1 and updates the weights of the generative neural network via backpropagation and gradient descent to minimize this loss.
[0261] Further details of training the generative neural network are described below with reference to FIG. 3.
[0262] FIG. 2C is a flow diagram of an example process 220 for training a generative neural network to generate output that interleaves audio and text tokens, where audio tokens precede text tokens. For convenience, the process 212 will be described as being performed by a system of one or more computers located in one or more locations. For example, a training system, e.g., the training system 103 of FIG. 1C, appropriately programmed in accordance with this specification, can perform the process 220.
[0263] The system obtains ataining example that includes (i) a training input and (ii) a target text output (step 222).
[0264] As described above, the training input can be data serving as a prompt or context for the generative neural network, such as an audio stream from a user (e.g., spoken input) that the model must transcribe or respond to.
[0265] As described above, the target text output can be a textual transcription of the audio contained within the training example.
[0266] The system identifies a partitioning of the target text output into a plurality of text segments (step 224).
[0267] As described above, the system can utilize alignment information (e.g.. timestamps) included in the training example that maps specific time intervals of the audio to corresponding units of text to determine the boundaries for dividing the target text output into distinct segments, such as individual words or phrases.
[0268] The system, for each text segment, identifies an audio segment that represents a verbalization of the text segment (step 226).
[0269] As described above, the system can use the start and end timestamps associated with the text segment to locate and extract the corresponding slice of the audio waveform from the training input.
[0270] In some cases, one or more of the plurality audio segments corresponds to speech of a respective word. For example, the system identifies an audio segment corresponding to the word “My” of a text segment based on the alignment data indicating the specific time interval (e.g., 0.0s to 0.5s) during which “My” was spoken.
[0271] In some cases, one or more the plurality audio segments correspond to speech of a respective chunk of multiple words and wherein the text transcription of the audio segment is Attorney Docket No. 45288-0595WO1 a transcription of the multiple words. For example, the system identifies an audio segment corresponding to the phrase "My daughter’ of a text segment based on the alignment data indicating the specific time interval during which '‘My daughter” as spoken.
[0272] The system generates a ground truth output sequence that includes, for each text segment, (i) a respective set of text tokens representing the text segment and (ii) a respective set of audio tokens representing the corresponding audio segment (step 228). The respective set of audio tokens precede the respective set of text tokens in the ground truth output sequence.
[0273] In some cases, each respective set of audio tokens is immediately preceded by a start of audio token in the ground truth output sequence. In some cases, each respective set of audio tokens is immediately followed by an end of audio token in the ground truth output sequence.
[0274] In some cases, the system generates the respective sets of audio tokens representing the audio segments by applying an audio tokenizer to the audio sequence. Examples of audio encoders include Conformer encoders configured to convert audio waveforms into discrete hard tokens. For example, the system can process the waveform for the word “daughter” through the audio tokenizer to produce corresponding discrete audio tokens by slicing the continuous waveform into fixed time steps (e.g., frames) and mapping the acoustic features of each frame to a specific integer code from a learned codebook.
[0275] In some implementations, the system generates the respective set of text tokens by applying a text tokenizer to the transcription of the audio segment. Examples of text tokenizers include the SentencePiece, or Byte-Pair Encoding (BPE) tokenizers. For example, the system can process the text string “daughter” through the text tokenizer to produce corresponding discrete text tokens by looking up the word (or its sub-word constituents) in a pre-defined vocabulary and assigning it the corresponding unique integer identifier used by the model.
[0276] The system trains the generative neural network using the training input and the ground truth output sequence (step 230).
[0277] As described above, in some cases, training the generative neural network incudes training it on a next token prediction objective. For example, the system calculates a loss function (e.g., cross-entropy loss) comparing the generated tokens to the ground truth tokens and updates the weights of the generative neural network via backpropagation and gradient descent to minimize this loss. Attorney Docket No. 45288-0595WO1
[0278] Further details of training the generative neural network are described below with reference to FIG. 3.
[0279] FIG. 2D is a flow diagram of an example process 232 for formatting inputs to the generative neural networks during dialogue sessions that include received audio (and optionally video) and generated audio at the same time step. For convenience, the process 232 will be described as being performed by a system of one or more computers located in one or more locations. For example, an inference system, e.g., the inference system 105 of FIG. ID, appropriately programmed in accordance with this specification, can perform the process 232.
[0280] The system initializes a communication session between a dialogue system and a user system (step 234).
[0281] As described above, the communication session between a dialogue system and a user system can include various pairs of interacting systems, such as a cloud-based Al assistant communicating with a user’s smartphone, an automated customer service agent communicating with a user’s telephone, or an on-device virtual assistant communicating via the microphone and speakers of a smart home device.
[0282] The system, at each of a plurality of time steps during the communication session, performs the steps 236-240.
[0283] The system obtains one or more audio tokens representing audio received from the user system at the time step (step 236).
[0284] As an example, the system can receive an audio waveform from the user system’s microphone, process it through an audio tokenizer to convert the continuous waveform into audio tokens.
[0285] In some implementations, at each of the plurality of time steps during the communication sessions, the system can obtain one or more video tokens representing video received from the user system at the time step.
[0286] As an example, the system can receive a video frame captured by the user’s camera, processes it through a video tokenizer (e.g., a Vision Transformer-based tokenizer), and convert the visual data into a sequence of video tokens.
[0287] The system processes an input set of tokens that includes (i) the one or more audio tokens and (ii) an output audio token generated by the dialogue system at the preceding time step in the communication session to generate one or more input tokens to a generative neural network (step 238). Attorney Docket No. 45288-0595WO1
[0288] As described above, in some cases, the input tokens to the generative neural network are vectors in an embedding space. In these cases, to process the input set of tokens, the system generates a respective embedding of each token in the input set of tokens. The system then processes an input that includes the respective embeddings of the tokens in the input set of tokens using an input-output encoder neural network to generate, as output, an input token that is in the embedding space.
[0289] The input-output encoder neural network can be a function , such as a Multi-Layer Perceptron (e.g., a 2-layer MLP), configured to map the concatenated embeddings of the distinct tokens into a single vector representation suitable for the generative neural network.
[0290] As an example, for a given time step t, the system identifies the output audio token generated by the dialogue system at the preceding time step (yt-i), the audio token received from the user at the current time step (x_audiot). and the video token received from the user at the current time step (x_videot). The system embeds and concatenates these tokens and processes them through the encoder f to generate the single input token xtaccording to the formulation: xt= f emhed_and_concat yt-i, x_audiot, x_videot')
[0291] In some cases, the input that includes the respective embeddings of the tokens in the input set of tokens further includes the respective input tokens at one or more preceding time steps during the communication session. For example, the input can further include the respective input tokens at all preceding time steps during the communication session.
[0292] As an example, the input to the encoder network can also include the sequence of input tokens generated for all previous time steps In this example, the system generates the current input token xtby processing the current embeddings alongside the history according to the formulation: xt= f embed_and_conccit(yt-x_audiot, x_videot, Xi..t-i))-
[0293] In some cases, the input set of tokens further includes audio tokens representing audio received from the user system at one or more preceding time steps during the communication session. For example, the input set processed by the system can include the audio tokens representing audio received from the user system at all preceding time steps during the communication session (e.g., x_audio1 t-i)-
[0294] In some implementations, processing the input set of tokens comprises processing the one or more audio tokens and the output audio token generated by the dialogue system at the preceding time step using a token mixing neural network to generate a combined token representing a combination of the audio received from the user system and the audio Attorney Docket No. 45288-0595WO1 generated by the dialogue system. For example, in this approach, the system can utilizes a small neural network (e.g.. a Multi-Layer Perceptron (MLP)) to map the distinct audio tokens from the user (x_audiot) and the system (y_audiot) into a single token representing the combined signal (e.g., conceptually mapping tokens for x and y to a token for x+y).
[0295] In some cases, this small neural netw ork is trained jointly with the generative neural network. For example, the parameters of the small neural network can be updated via backpropagation based on the objective loss function of the generative neural network.
[0296] In some implementations, to process the input set of tokens, the system generates a sequence of the input set of tokens that interleaves the (ii) output audio token and the (i) one or more audio tokens.
[0297] For example, given an output audio token generated by the dialogue system at the preceding time step (denoted as yt_i) for the current time step t and one audio token representing audio received from the user system at the time step t (denoted as x_audiot), the system can generate the sequence [yt-i, x_audiot] for each time step.
[0298] In some implementations, the input set of tokens can include (i) the one or more audio tokens and (ii) the output audio token generated by the dialogue system at the preceding time step, and (iii) one or more video tokens. The one or more video tokens can be those the system obtains, at each of the plurality of time steps during the communication sessions, from the user system at the time step representing video.
[0299] For example, continuing with the previous example, if the user system transmits video frames alongside audio, the system can generates a sequence that interleaves all three modalities. Given the output audio token generated by the dialogue system at the preceding time step (denoted as y^), the audio token received from the user at time t (denoted as x_audiot), and the set of video tokens representing the video frame captured at time t (denoted as x_videot), the system can generate a sequence [yt-1, x_audiot, x_videot] for each time step.
[0300] The system processes the one or more input tokens using the generative neural network to generate one or more output audio tokens for the time step (step 240).
[0301] In some cases, the generative neural network is an auto-regressive generative neural network that auto-regressively generates output tokens.
[0302] In some cases, after the system processes input token(s) using the generative neural network to generates output audio token(s) for the time step, the system processes the one or more output audio tokens to generate audio for the time step. The system then provides the Attorney Docket No. 45288-0595WO1 audio for the time step for playback at the user system. For example, the system converts the generated one or more output audio tokens into an audio waveform (e.g., using a decoder associated with the audio tokenizer) and streams it to a user’s device for immediate playback.
[0303] FIG. 3 is a flow diagram of an example process 300 for training a generative neural network. For convenience, the process 300 will be described as being performed by a system of one or more computers located in one or more locations. For example, a training system, e.g., the inference system 100 of FIG. 1A, training system of FIG. IB, or training system of FIG. 1C, appropriately programmed in accordance with this specification, can perform the process 300.
[0304] The system receives a training input and ground truth output sequence (step 302). As described above, the specific composition of the training input and ground truth output sequence depends on the specific training task.
[0305] For example, in the context of the system of FIG. 1A, the training input includes audio data representing speech (e.g., audio tokens). The ground truth output sequence is formed by appending timing tokens to the text tokens of the transcription, such that each unit of text (e.g., a word) is immediately followed by one or more timing tokens representing the specific timestamp at which that unit was spoken.
[0306] As another example, in the context of the systems of FIG. IB and FIG. 1C, the training input includes a context or prompt. For the system of FIG. IB, the ground truth output sequence is structured such that text tokens precede their corresponding audio tokens. Conversely, for the system of FIG. 1C, the ground truth output sequence is structured such that audio tokens precede their corresponding text tokens.
[0307] The system processes the training input with generative neural network (step 304). In some implementations, the system processes the training input (and potentially the ground truth output sequence using teacher forcing) to generate a sequence of probability distributions over a vocabulary of tokens. For example, for each position in the output sequence, the generative neural network predicts the likelihood of each token in the vocabulary (including text, audio, and timing tokens) being the next token, conditioned on the training input and the preceding tokens.
[0308] In implementations where the system utilizes an additional neural network (e g., the small neural network, e.g., the multi-layer perceptron (MLP) described above) to mix or project tokens, processing the training input includes processing input tokens (e.g., the user audio token and the system audio token) using the additional neural network to generate input(s) that are further processed by the generative neural network. Attorney Docket No. 45288-0595WO1
[0309] The system computes the objective function (step 306). The system can calculate an objective function, such as a next token prediction objective (or cross-entropy loss), that measures the difference between the probability distributions generated by the network and the actual tokens in the ground truth output sequence. For example, the system compares the predicted probability of a specific token against the actual occurrence of that token in the ground truth output sequence.
[0310] The system updates network parameters (step 308). The system updates the parameters of the generative neural network to minimize the computed objective function. For example, the system can compute gradients of the loss function with respect to the network parameters using backpropagation and apply an optimization algorithm (e.g., gradient descent based method, e.g.. stochastic gradient descent or Adam) to adjust the parameters, thereby improving the generative neural network’s ability to generate the correct sequence of text, timing, and audio tokens (as is appropriate for the training task) in future inferences.
[0311] In implementations utilizing the additional neural network when processing inputs, for example, a small neural network (e.g., the MLP) for token mixing or projection, the system updates the parameters of the additional neural network jointly with the parameters of the generative neural network. For example, the system can compute gradients of the objective function that flow back through the generative neural network and into the additional neural network, updating the weights of the additional neural network.
[0312] The system can repeat steps 302 through 308 for multiple batches of training examples until a stopping criterion is satisfied.
[0313] The stopping criterion can be, for example, a determination that the system has performed the training for a predetermined number of training steps. As another example, the stopping criterion can be a determination that the generative neural network has exceeded a predefined performance threshold. In some cases, the stopping criterion can be a determination that the computed objective function has converged, indicating that further training is unlikely to yield significant performance improvements.
[0314] FIG. 4 shows examples 400 and 402 of the performance of the described techniques that incorporates timing information in speech recognition outputs.
[0315] Example 400 is a table that compares the Word Error Rate (WER) of a production automatic speech recognition system (“USM RNN-T (prod)”) against the system utilizing the described techniques on two different test datasets. The first row displays results for “YT en_us sets,” where the production system and the described techniques both achieved a WER Attorney Docket No. 45288-0595WO1 of 9.80%. The second row displays results for a “YT 72 lang sef ' (a dataset covering 72 languages), where the production system achieved a WER of 26.70% and the described techniques achieved a comparable WER of 27.20%.
[0316] Example 400 shows that the described techniques can achieve transcription accuracy (WER) that matches or is comparable to existing production baselines, even while performing the additional task of generating timestamps. This positive result of maintaining high transcription fidelity is enabled by the described techniques’ generative neural network generating an output sequence of output tokens selected from a vocabulary that includes a plurality of text tokens and a plurality of timing tokens. By training the model to predict timing tokens auto-regressively alongside text tokens, the described techniques preserve the semantic accuracy of the text tokens representing spoken words while simultaneously embedding the necessary timing information.
[0317] Example 402 is a table that compares the Word Timing Error Rate (WTER) error percentiles between the production system (“USM RNN-T (prod)”) and the described techniques. The table reports errors in milliseconds (ms) at the 50th, 95th, and 99th percentiles. For the 50th percentile (median) error, the production system’s result is not reported, while the described techniques achieved an error of 11 ms. For the 95th percentile, the production system showed an error of 240 ms, whereas the described techniques achieved a significantly lower error of 49 ms. For the 99th percentile, the production system’s result is not reported, while the described techniques achieved an error of 120 ms.
[0318] Example 402 shows that the described techniques significantly outperform the production baseline in terms of timing precision, particularly at the 95th percentile where the error is reduced by nearly a factor of five. By explicitly generating discrete timing tokens within the output sequence that map to specific time values (e.g., at 40ms precision), the described techniques achieves finer temporal resolution and tighter alignment with the speech than the baseline technique.
[0319] This specification uses the term “configured” in connection with systems and computer program components. For a system of one or more computers to be configured to perform particular operations or actions means that the system has installed on it software, firmware, hardware, or a combination of them that in operation cause the system to perform the operations or actions. For one or more computer programs to be configured to perform particular operations or actions means that the one or more programs include instructions that, when executed by data processing apparatus, cause the apparatus to perform the operations or actions. Attorney Docket No. 45288-0595WO1
[0320] Embodiments of the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, in tangibly-embodied computer software or firmware, in computer hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, e.g., one or more modules of computer program instructions encoded on a tangible non transitory storage medium for execution by, or to control the operation of, data processing apparatus. The computer storage medium can be a machine- readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them. Alternatively or in addition, the program instructions can be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus.
[0321] The term “data processing apparatus” refers to data processing hardware and encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. The apparatus can also be, or further include, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit). The apparatus can optionally include, in addition to hardware, code that creates an execution environment for computer programs, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.
[0322] A computer program, which may also be referred to or described as a program, software, a software application, an app, a module, a software module, a script, or code, can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages; and it can be deployed in any form, including as a stand alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A program may, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data, e.g., one or more scripts stored in a markup language document, in a single file dedicated to the program in question, or in multiple coordinated files, e.g., files that store one or more modules, sub programs, or portions of code. A computer program can be deployed to Attorney Docket No. 45288-0595WO1 be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a data communication network.
[0323] In this specification, the term "database" is used broadly to refer to any collection of data: the data does not need to be structured in any particular way, or structured at all, and it can be stored on storage devices in one or more locations. Thus, for example, the index database can include multiple collections of data, each of which may be organized and accessed differently.
[0324] Similarly, in this specification the term “engine” is used broadly to refer to a softwarebased system, subsystem, or process that is programmed to perform one or more specific functions. Generally, an engine will be implemented as one or more software modules or components, installed on one or more computers in one or more locations. In some cases, one or more computers will be dedicated to a particular engine; in other cases, multiple engines can be installed and running on the same computer or computers.
[0325] The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by special purpose logic circuitry, e.g., an FPGA or an ASIC, or by a combination of special purpose logic circuitry' and one or more programmed computers.
[0326] Computers suitable for the execution of a computer program can be based on general or special purpose microprocessors or both, or any other kind of central processing unit. Generally, a central processing unit will receive instructions and data from a read only memory or a random access memory' or both. The essential elements of a computer are a central processing unit for performing or executing instructions and one or more memory' devices for storing instructions and data. The central processing unit and the memory can be supplemented by, or incorporated in, special purpose logic circuitry. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks. However, a computer need not have such devices. Moreover, a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device, e.g., a universal serial bus (USB) flash drive, to name just a few.
[0327] Computer readable media suitable for storing computer program instructions and data include all forms of non volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory Attorney Docket No. 45288-0595WO1 devices; magnetic disks, e.g., internal hard disks or removable disks; magneto optical disks; and CD ROM and DVD-ROM disks.
[0328] To provide for interaction with a user, embodiments of the subject matter described in this specification can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid cry stal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory' feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input. In addition, a computer can interact with a user by sending documents to and receiving documents from a device that is used by the user; for example, by sending web pages to a web browser on a user’s device in response to requests received from the web browser. Also, a computer can interact with a user by sending text messages or other forms of message to a personal device, e.g., a smartphone that is running a messaging application, and receiving responsive messages from the user in return.
[0329] Data processing apparatus for implementing machine learning models can also include, for example, special-purpose hardware accelerator units for processing common and compute-intensive parts of machine learning training or production, e.g., inference, workloads.
[0330] Machine learning models can be implemented and deployed using a machine learning framework, e.g., a TensorFlow framework or a Jax framework.
[0331] Embodiments of the subject matter described in this specification can be implemented in a computing system that includes a back end component, e g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front end component, e.g., a client computer having a graphical user interface, a web browser, or an app through which a user can interact with an implementation of the subject matter described in this specification, or any combination of one or more such back end, middleware, or front end components. The components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (LAN) and a wide area network (WAN), e.g., the Internet.
[0332] The computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. Attorney Docket No. 45288-0595WO1
[0333] The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. In some embodiments, a server transmits data, e.g., an HTML page, to a user device, e.g., for purposes of displaying data to and receiving user input from a user interacting with the device, which acts as a client. Data generated at the user device, e.g., a result of the user interaction, can be received at the server from the device.
[0334] While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or on the scope of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially be claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.
[0335] Similarly, while operations are depicted in the drawings and recited in the claims in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.
[0336] Particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still achieve desirable results. As one example, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous.
[0337] What is claimed is:
Claims
Attorney Docket No. 45288-0595WO1CLAIMS1. A method performed by a set of one or more computers, the method comprising: receiving a network input that comprises audio data representing speech spoken during a given time window: processing the network input using a generative neural network to generate an output sequence of output tokens that represents a transcription of the speech, wherein each output token is selected from a vocabulary of tokens that includes a plurality of text tokens and a plurality of timing tokens; processing the output sequence of output tokens to generate a speech recognition output, comprising: identifying, in the output sequence, a plurality of contiguous subsequences of timing tokens; identifying, for each contiguous subsequence of timing tokens, a corresponding respective text unit that is represented by text tokens that precede the contiguous subsequence of timing tokens in the output sequence; determining, for each contiguous subsequence of timing tokens, a respective timestamp within the given time window that is represented by the contiguous subsequence of timing tokens; and providing, as the speech recognition output, (i) the text represented by the tokens in the subsequence and (ii) data that specifies that, for each contiguous subsequence of timing tokens, the corresponding text unit for the contiguous subsequence of timing tokens was spoken at the respective timestamp represented by the contiguous subsequence of timing tokens.
2. The method of claim 1, wherein the generative neural network is configured to include, within the output sequence, a respective contiguous subsequence of timing tokens after each subsequence of text tokens that represent a specific unit of text in a natural language.
3. The method of claim 2, wherein the specific unit of text is a word in the natural language.Attorney Docket No. 45288-0595WO14. The method of any one of claims 2 or 3, wherein the generative neural network has been trained on training examples that each include (i) audio representing training speech, (ii) a ground truth transcription of the training speech, and (iii) respective timing data that specifies for each of one or more units of text within the ground truth transcription, a respective timestamp at which the unit of text was spoken.
5. The method of claim 4, wherein training the generative neural network on the training examples comprises, for each training example: generating, from the ground truth transcription of training speech and the respective timing data, a ground truth output sequence that includes, for each of the one or more units of text within the ground truth transcription, text tokens representing the unit of text followed by one or more timing tokens representing the respective timestamp at which the unit of text was spoken; and training the generative neural network using the audio in the training example and the ground truth output sequence for the training example.
6. The method of any preceding claim, wherein each timing token represents a respective symbol in an alphabet of symbols in a particular numeral system.
7. The method of claim 6, wherein determining, for each contiguous subsequence of timing tokens, a respective timestamp within the given time window that is represented by the contiguous subsequence of timing tokens comprises: mapping each timing token in the subsequence to the respective symbol represented by the timing token; and decoding the respective symbols according to the particular numeral system to generate a timestamp.
8. The method of any one of claims 6-7, wherein the particular numeral system is base- 64.
9. The method of any one of claims 6-8, wherein the particular numeral system is base-Attorney Docket No. 45288-0595WO110. The method of any one of claims 7-9, wherein determining, for each contiguous subsequence of timing tokens, a respective timestamp within the given time window that is represented by the contiguous subsequence of timing tokens further comprises: determining whether the timestamp is outside of the given time window; and in response to determining that the timestamp is outside the time window, modifying the timestamp to indicate a last timestamp that is within the given time window.
11. The method of any one of claims 7-10, wherein determining, for each contiguous subsequence of timing tokens, a respective timestamp within the given time window that is represented by the contiguous subsequence of timing tokens further comprises: determining whether the timestamp is earlier than a timestamp represented by an immediately preceding contiguous subsequence of timing tokens in the output sequence; and in response to determining that the timestamp is earlier than a timestamp represented by an immediately preceding contiguous subsequence of timing tokens in the output sequence, modifying the timestamp to indicate a timestamp that is no earlier than the timestamp represented by an immediately preceding contiguous subsequence of timing tokens in the output sequence.
12. The method of any preceding claim, wherein the generative neural network comprises an auto-regressive neural network that auto-regressively generates tokens from the vocabulary.
13. A system comprising: one or more computers; and one or more storage devices communicatively coupled to the one or more computers, wherein the one or more storage devices store instructions that, when executed by the one or more computers, cause the one or more computers to perform the operations of the respective method of any one of claims 1-12.
14. One or more non-transitory computer storage media storing instructions that when executed by one or more computers cause the one or more computers to perform the operations of the respective method of any one of claims 1-12.Attorney Docket No. 45288-0595WO115. A method of training a generative neural network performed by a set of one or more computers, the method comprising: obtaining a training example that comprises (i) a training input and (ii) a target output that comprises an audio sequence; identifying a partitioning of the audio sequence into a plurality of audio segments; for each audio segment, identifying a text transcription of the audio segment; generating a ground truth output sequence that comprises, for each audio segment, (i) a respective set of audio tokens representing the audio segment and (ii) a respective set of text tokens representing the text transcription of the audio segment, wherein the respective set of text tokens precede the respective set of audio tokens in the ground truth output sequence; and training the generative neural network using the training input and the ground truth output sequence.
16. The method of claim 15, wherein training the generative neural network using the training input and the ground truth output sequence comprises: training the generative neural network on a next token prediction objective.
17. The method of claim 15 or claim 16. wherein each respective set of audio tokens is immediately preceded by a start of audio token in the ground truth output sequence.
18. The method of claim 17, wherein each respective set of audio tokens is immediately followed by an end of audio token in the ground truth output sequence.
19. The method of any of claims 15 to 18, wherein one or more of the plurality audio segments each correspond to speech of a respective word and wherein the text transcription of the audio segment is a transcription of the respective word.
20. The method of any of claims 15 to 19, wherein one or more of the plurality audio segments each correspond to speech of a respective chunk of multiple words and wherein the text transcription of the audio segment is a transcription of the multiple words.
21. The method of any of claims 15 to 20, further comprising: generating the respective sets of audio tokens representing the audio segments by applying an audio tokenizer to the audio sequence.Attorney Docket No. 45288-0595WO122. The method of any of claims 15 to 21, further comprising: generating the respective set of text tokens by applying a text tokenizer to the transcription of the audio segment.
23. A method of training a generative neural network performed by a set of one or more computers, the method comprising: obtaining a training example that comprises (i) a training input and (ii) a target text output; identifying a partitioning of the target text output into a plurality of text segments; for each text segment, identifying an audio segment that represents a verbalization of the text segment; generating a ground truth output sequence that comprises, for each text segment, (i) a respective set of text tokens representing the text segment and (ii) a respective set of audio tokens representing the corresponding audio segment, wherein the respective set of audio tokens precede the respective set of text tokens in the ground truth output sequence; and training the generative neural network using the training input and the ground truth output sequence.
24. The method of claim 23, wherein training the generative neural network using the training input and the ground truth output sequence comprises: training the generative neural network on a next token prediction objective.
25. The method of claim 23 or claim 24. wherein each respective set of audio tokens is immediately preceded by a start of audio token in the ground truth output sequence.
26. The method of claim 25, wherein each respective set of audio tokens is immediately followed by an end of audio token in the ground truth output sequence.
27. The method of any of claims 23 to 26, wherein one or more of the plurality audio segments corresponds to speech of a respective word.
28. The method of any of claims 23 to 27, wherein one or more the plurality audio segments correspond to speech of a respective chunk of multiple words and wherein the text transcription of the audio segment is a transcription of the multiple words.Attorney Docket No. 45288-0595WO129. The method of any of claims 23 to 28, further comprising: generating the respective sets of audio tokens representing the audio segments by applying an audio tokenizer to the audio sequence.
30. The method of any of claims 23 to 29, further comprising: generating the respective set of text tokens by applying a text tokenizer to the transcription of the audio segment.
31. A system comprising: one or more computers; and one or more storage devices communicatively coupled to the one or more computers, wherein the one or more storage devices store instructions that, when executed by the one or more computers, cause the one or more computers to perform the operations of the respective method of any one of claims 15-30.
32. One or more non-transitory computer storage media storing instructions that when executed by one or more computers cause the one or more computers to perform the operations of the respective method of any one of claims 1 -30.
33. A method performed by one or more computers, the method comprising: initializing a communication session between a dialogue system and a user system; and at each of a plurality of time steps during the communication session: obtaining one or more audio tokens representing audio received from the user system at the time step; processing an input set of tokens that comprises (i) the one or more audio tokens and (ii) an output audio token generated by the dialogue system at the preceding time step in the communication session to generate one or more input tokens to a generative neural network; and processing the one or more input tokens using the generative neural network to generate one or more output audio tokens for the time step.
34. The method of claim 33, wherein the generative neural network is an auto-regressive generative neural network that auto-regressively generates output tokens.Attorney Docket No. 45288-0595WO135. The method of claim 33 or claim 34, further comprising: processing the one or more output audio tokens to generate audio for the time step; and providing the audio for the time step for playback at the user system.
36. The method of any one of claims 33-35, wherein the input tokens to the neural network are vectors in an embedding space, and wherein processing the input set of tokens comprises: generating a respective embedding of each token in the input set of tokens; and processing an input comprising the respective embeddings of the tokens in the input set of tokens using an input-output encoder neural network to generate, as output, an input token that is in the embedding space.
37. The method of claim 36, wherein the input comprising the respective embeddings of the tokens in the input set of tokens further comprises the respective input tokens at one or more preceding time steps during the communication session.
38. The method of claim 36 or claim 37, wherein the input set of tokens further comprises audio tokens representing audio received from the user system at one or more preceding time steps during the communication session.
39. The method of any of claims 33 to 38, further comprising: at each of a plurality of time steps during the communication session: obtaining one or more video tokens representing video received from the user system at the time step, wherein the input set of tokens comprises (i) the one or more audio tokens and (ii) the output audio token generated by the dialogue system at the preceding time step, and (iii) the one or more video tokens.
40. The method of any of claims 33 to 39, wherein processing the input set of tokens comprises generating a sequence of the input set of tokens that interleaves the output audio token and the one or more audio tokens.Attorney Docket No. 45288-0595WO141. A system comprising: one or more computers; and one or more storage devices communicatively coupled to the one or more computers, wherein the one or more storage devices store instructions that, when executed by the one or more computers, cause the one or more computers to perform the operations of the respective method of any one of claims 33-40.
42. One or more non-transitory computer storage media storing instructions that when executed by one or more computers cause the one or more computers to perform the operations of the respective method of any one of claims 33-40.