Hypothesis concatenator for speech recognition of long-form audio
By segmenting long-format audio into fragments and using a hypothesis splicer to merge and summarize short-fragment hypotheses, the problems of accuracy and cost in long-format speech recognition systems are solved, achieving more efficient speech recognition results.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- MICROSOFT TECHNOLOGY LICENSING LLC
- Filing Date
- 2021-11-23
- Publication Date
- 2026-06-23
AI Technical Summary
Existing end-to-end automatic speech recognition systems perform poorly when processing long-format speech, especially speech lasting 10 minutes or longer, resulting in decreased accuracy and increased computational costs.
The audio stream is segmented into multiple audio segments, the speaker in each segment is identified, automatic speech recognition is performed to generate short segment hypotheses, and these hypotheses are merged and summarized by a hypothesis splicer. A network-based splicer is used for splicing, including window-changing symbols and speaker feature identification, to reduce computational costs.
It improves the accuracy of speech recognition for long-format audio and reduces computational costs, significantly reduces the word error rate in multi-speaker audio, and improves the speed and efficiency of speech recognition.
Smart Images

Figure CN116648744B_ABST
Abstract
Description
Background Technology
[0001] End-to-end (E2E) automatic speech recognition (ASR) systems use a single neural network (NN) to convert audio into word sequences, making them generally simpler than earlier ASR systems. E2E ASR solutions typically ingest short audio segments to process complete utterances before generating hypotheses. Unfortunately, models trained on short utterances often perform poorly when applied to speech exceeding the length of the training data. This can occur with long-formatted speech (e.g., speech lasting 10 minutes or more), a situation that may arise when transcribing streaming audio and in other ASR tasks. Summary of the Invention
[0002] The disclosed examples are described in detail below with reference to the accompanying drawings. The following summary is provided to illustrate some of the examples disclosed herein. However, this is not intended to limit all examples to any particular configuration or sequence of operations.
[0003] A hypothesis splicer for speech recognition of long-format audio offers superior performance, such as higher accuracy and lower computational cost. Disclosed operational examples include: segmenting an audio stream into multiple audio segments; identifying multiple speakers within each of the multiple audio segments; performing automatic speech recognition (ASR) on each of the multiple audio segments to generate multiple short-segment hypotheses; merging at least a portion of the short-segment hypotheses into a first merged hypothesis set; inserting splicing symbols, including window change (WC) symbols, into the first merged hypothesis set; and summarizing the first merged hypothesis set into a first summarized hypothesis using a network-based hypothesis splicer. Various variations are disclosed, including alignment-based splicers and serial splicers, which can operate as speaker-specific splicers or multi-speaker splicers, and can also support multiple options for different hypothesis configurations. Attached Figure Description
[0004] The disclosed example is described in detail below with reference to the accompanying drawings:
[0005] Figure 1 An arrangement for speech recognition is shown, which advantageously employs a hypothetical splicer for speech recognition of long-format audio.
[0006] Figure 2A and Figure 2B It shows that it can be used with Figure 1 An example of windows overlapping when used together;
[0007] Figure 3 It shows Figure 1 More details on the hypothetical splicer example;
[0008] Figure 4 It shows that it can be used Figure 1 The set of example hypotheses for alignment-based splicers used in the arrangement;
[0009] Figure 5 It shows that it can be used Figure 1 An example of the set of merged assumptions used in the arrangement of serial splicers;
[0010] Figure 6A , Figure 6B , Figure 6C and Figure 6D It shows that it can be used Figure 1 An example of a set of merged assumptions for a multi-speaker serial splicer variant used in an arrangement;
[0011] Figure 7 It is shown that... Figure 1 A flowchart of exemplary operations associated with the arrangement;
[0012] Figure 8 It is shown that... Figure 1 Another flowchart illustrating the exemplary operations associated with the arrangement; and
[0013] Figure 9 This is a block diagram of an example computing environment suitable for implementing some of the various examples disclosed herein.
[0014] Throughout the accompanying drawings, corresponding reference numerals indicate the corresponding parts. Detailed Implementation
[0015] Various examples will be described in detail with reference to the accompanying drawings. Where possible, the same reference numerals will be used throughout the drawings to denote the same or similar parts. References to specific examples and implementations throughout this disclosure are for illustrative purposes only and are not intended to limit all examples unless otherwise stated.
[0016] A hypothesis splicer for speech recognition of long-format audio offers superior performance, such as higher accuracy and lower computational cost. Disclosed operational examples include: segmenting an audio stream into multiple audio segments; identifying multiple speakers within each of the multiple audio segments; performing automatic speech recognition (ASR) on each of the multiple audio segments to generate multiple short-segment hypotheses; merging at least a portion of the short-segment hypotheses into a first merged hypothesis set; inserting splicing symbols, including window change (WC) symbols, into the first merged hypothesis set; and summarizing the first merged hypothesis set into a first summarized hypothesis using a network-based hypothesis splicer. Various variations are disclosed, including alignment-based splicers and serial splicers, which can operate as speaker-specific splicers or multi-speaker splicers, and can also support multiple options for different hypothesis configurations.
[0017] Various aspects of this invention improve the speed and accuracy of speech recognition by merging short hypotheses into a merged hypothesis set and then summarizing the merged hypothesis set into a summarized hypothesis using a network-based hypothesis splicer. The network-based hypothesis splicer offers superior accuracy. Some examples employ a serial splicer that does not require alignment of odd and even hypothesis sequences (e.g., word alignment), thereby reducing the required overlap and thus lowering computational costs.
[0018] The hypothesis splicer ingests multiple hypotheses from short audio segments and outputs a fused single hypothesis, significantly improving the speaker-attributed word error rate (SA-WER) for long-format multi-speaker audio. As used in this paper, the hypothesis is an estimated content of the audio and may include sequences of estimated words or tokens representing words. The hypothesis may also include other estimated content, such as speaker identifiers, language identifiers, and other speaker representations, as well as other labels or symbols. Several variations of the model architecture are disclosed, including some that reduce computational cost due to relaxed overlap requirements.
[0019] The example uses sliding windows with overlap between segments to segment long audio, with end-to-end (E2E) ASR applied to each window to generate a hypothesis. A sequence-to-sequence model, trained to fuse multiple hypotheses from the overlapping windows, merges the hypotheses from each window into a single hypothesis. Hypothesis merging can be performed with significantly high accuracy using a machine learning (ML) module.
[0020] Figure 1An arrangement 100 for speech recognition is shown, which advantageously employs a hypothetical splicer for speech recognition of long-format audio. Microphone 104 receives (captures) an audio stream 102 from multiple speakers 106, including speakers 106a, 106b, and 106c. Audio segmenter 108 receives the audio stream 102 and segments it into multiple audio segments 110. As shown, the multiple audio segments 110 include audio segments 111, 112, 113, 114, and 115, although it should be understood that different numbers of audio segments can be used.
[0021] Briefly redirect to Figure 2A and Figure 2B This will describe the overlap of audio segments. Figure 2A The diagram shows 50% window overlap in overlapping scene 200a, where each odd-numbered audio segment stops and starts back-to-back (e.g., without gaps), and even-numbered audio segments similarly stop and start back-to-back (e.g., without gaps). Variations between odd-numbered windows occur within even-numbered windows, and variations between even-numbered windows similarly occur within odd-numbered windows, resulting in parallel, interleaved sequences. Audio stream 102 is shown twice, once annotated with odd-numbered audio segments (e.g., audio segments 111, 113, and 115) and also with even-numbered audio segments (e.g., audio segments 112 and 113). The window length 211 of audio segment 111 is consistent with the other audio segments 112-115, e.g., matching the window length 212 of audio segment 112. The overlap duration 202 is half (50%) of each window length 211 and window length 212. The five audio segments 111-115 together cover time period 204. In some examples, the window length 211 is 16 seconds or 30 seconds, or some other time period with a duration of less than one minute, while the length of the audio stream can exceed 10 minutes or even more than one hour.
[0022] Figure 2BA 25% window overlap is shown in overlapping scene 200b. Variations between odd-numbered windows occur within even-numbered windows, and vice versa, resulting in parallel, staggered sequences – but allowing gaps within each odd and even sequence. Audio stream 102 is shown twice, once annotated with odd-numbered audio segments (e.g., audio segments 111, 113, and 115) and also with even-numbered audio segments (e.g., audio segments 112 and 113). The window length 211 of audio segment 111 is consistent with the other audio segments 112-115, e.g., matching the window length 212 of audio segment 112. The overlap duration 206 is one-quarter (25%) of each window length 211 and window length 212. The five audio segments 111-115 together cover time period 208, which is longer than time period 204 of overlapping scene 200a (with 50% window overlap).
[0023] Therefore, for the same time length as time segment 204, fewer audio segments are processed for 25% overlap than for 50% coverage, thus reducing computational costs. The assumed overlap follows the overlap of the audio segments. It should be understood that different amounts of overlap can be used in various aspects of this disclosure, including overlap as low as 10% or less.
[0024] return Figure 1 Multiple audio segments 110 are provided to speaker identifiers 120, which use speaker profiles 122 (e.g., information about speaker characteristics of speakers 106a-106c) to associate each speaker 106a-106c with utterances within the multiple audio segments 110. The multiple audio segments 110 are also provided to an E2E speech recognition engine (SRE) 130 to perform speech recognition and output hypotheses. In some examples, a different form of SRE may be used instead of E2E. The E2E SRE 130 is trained by an E2E trainer 132 using speech recognition training data 134. The hypotheses output by the E2E SRE 130 are grouped into speaker-specific short segment hypotheses 138 based on speaker grouping 136. ASR is performed on each audio segment 111-115 to produce... Where K is the number of speakers 106a-106c (and in this example, it is also the number of profiles in speaker profile 122).
[0025] Short passage hypothesis 138 is illustrated using a set of 15 examples of hypotheses, with 6 hypotheses for each speaker 106a-106c. S1H1, S1H2, S1H3, S1H4, S1H5, and S1H6 are the 6 hypotheses for speaker 106a, in chronological order. S2H1, S2H2, S2H3, S2H4, S2H5, and S2H6 are the 6 hypotheses for speaker 106b, in chronological order. S3H1, S3H2, S3H3, S3H4, S3H5, and S3H6 are the 6 hypotheses for speaker 106c, in chronological order. It should be understood that the 6 utterances each from the three speakers (each generating 6 hypotheses) are merely examples.
[0026] Connector 140 merges at least a portion (e.g., at least some) of the short fragment hypothesis 138 into a merged hypothesis 150, which includes at least a merged hypothesis set 150a. Typically, the hypotheses of the short fragment hypothesis 138 can be merged into a set in the form shown in Formula 1:
[0027]
[0028] in, Let m represent the assumption about speaker k in audio segment m, where k ranges from 1 to the number of speakers K, and m ranges from 1 to the number of audio segments M. For example, audio stream 102 is divided into M segments (with overlap). This is the set of merging hypotheses for speaker k (e.g., merging hypothesis set 150a). In some examples, if speaker k is not detected in audio segment m, then a hypothesis will be assumed... Set to an empty sequence.
[0029] Several options for the operation of connector 140 and the format of merge hypothesis 150 are disclosed in the following figures. Variations include whether merge hypothesis 150 includes speaker-specific merge hypothesis sets (e.g., each merge hypothesis set in merge hypothesis set 150a, merge hypothesis set 150b, and merge hypothesis set 150a is only for a single speaker 106a-106c), or whether merge hypothesis 150 includes a multi-speaker version of merge hypothesis set 150a.
[0030] Symbol inserter 142 inserts concatenation symbols 144 into the merge hypothesis 150, for example, inserting merge hypothesis set 150a and merge hypothesis sets 150b and 150c (if used). Concatenation symbols 144 include a general window change (WC) symbol 144a, and in some examples, include an even window change (WCE) symbol 144b (indicating a change from an odd window to an even window) and an odd window change (WCO) symbol 144c (changing from an even window to an odd window). A window change corresponds to the end of a word or token sequence identified from one audio segment to the beginning of a word or token sequence identified from the next audio segment. In some examples, concatenation symbols 144 may also include: a speaker identifier (SPKR_k, where k is the speaker's index number), speaker characteristics (e.g., language (LANG_k), speaker age (AGE_k), and accent (ACCENT_k)), and hypothesis ranking. Several options for using concatenation symbols 144 are disclosed, such as... Figures 4-6D As shown.
[0031] Assume that splicer 300 summarizes merged hypothesis 150 into summarized hypothesis 160. In some examples, this can be achieved by summarizing speaker-specific merged hypothesis sets 150a-150c into speaker-specific summarized hypotheses (e.g., summarized hypothesis 160a, summarized hypothesis 160b, and summarized hypothesis 160c), while in other examples, merged hypothesis set 150a is a multi-speaker merged hypothesis set that is summarized into a multi-speaker version of summarized hypothesis 160a. Hypothesis splicer 300 can be network-based and, in some examples, can include a neural network (NN). Various configurations are disclosed, such as alignment-based splicers and serial splicers that do not use alignment of odd-numbered hypothesis sequences with even-numbered hypothesis sequences, thus allowing for relaxed overlap requirements. Reference Figure 3 More details are provided regarding the hypothetical splicer 300, the splicer trainer 310, and the hypothetical splicer training data 312.
[0032] Summarizing hypothesis 160a, output as transcription 170, can be used for various tasks where ASR results are useful, including live transcription of conversations (e.g., video calls or voice) or streaming video, as well as voice commands. In some examples, summarizing hypothesis 160a is a multi-speaker summarizing hypothesis and includes multiple speakers (e.g., speaker 106a-speaker 106c). In some examples, each summarizing hypothesis in summarizing hypothesis 160a-summarizing hypothesis 160c is a speaker-specific summarizing hypothesis, and transcription 170 will be a speaker-specific transcription unless summarizing hypothesis 160a-summarizing hypothesis 160c is merged into the multi-speaker version of transcription 170 by transcription merger and annotator 162. In some examples, transcription merger and annotator 162 ingests each summarizing hypothesis in summarizing hypothesis 160a-summarizing hypothesis 160 and outputs the multi-speaker version of transcription 170. In some examples, the transcript merger and annotator 162 annotates transcript 170 with a timestamp obtained by timer 164, which can be used to define the time window used by audio segmenter 108 (so that the timestamp is properly synchronized with audio stream 102).
[0033] Figure 3 Further details of the hypothesis splicer 300 are shown. The hypothesis splicer 300 includes an encoder 304, a decoder 306, and an embedding function 302. The embedding function 302 ingests symbols (e.g., words, tokens, or splicing symbols) and outputs a vector corresponding to that symbol (called an embedding). In some examples, the hypothesis splicer 300 includes a transducer-based attention encoder-decoder architecture. In some examples, the hypothesis splicer 300 includes a sequence-to-sequence model (e.g., a trained neural network) that merges multiple hypotheses from short segments of audio into a single hypothesis. In some examples, the hypothesis splicer 300 is trained using dialogue that has been segmented according to overlaps that will be used during operation, labeled with splicing symbols 144 (so that the hypothesis splicer 300 learns splicing symbols 144), and labeled for training. That is, the hypothesis splicer training data 312 includes splicing symbols 144 and has overlaps similar to those in the merged hypothesis set 150a.
[0034] In some examples, error 314 is inserted into the hypothesis splicer training data 312, causing the hypothesis splicer 300 to learn to correct errors such as incorrect words 316 (e.g., words misidentified in the hypothesis) and incorrect speakers 318 (e.g., incorrectly identified speakers). Regarding Figure 4 and Figure 5 Variations such as alignment-based splicers and serial splicers are described in more detail.
[0035] Figure 4Example hypothetical sequences, sequence 400o (o stands for "odd") and sequence 400e (e stands for "even"), are shown. These are combined into hypothesis set 150a for an alignment-based splicer version of hypothesis splicer 300. In the scenario shown, 50% overlap is used, where odd-numbered audio segments stop and start and overlap with even-numbered audio segments, as shown. Figure 2A As shown. Hypotheses from the odd-numbered (e.g., 401) and even-numbered (e.g., 402) hypothesis sets are concatenated into two sequences, 400o and 400e, as follows. Figure 4 As shown.
[0036] Sequence 400o For speaker k, there is the odd-numbered hypothesis 401 (1 is odd) and other odd-numbered hypotheses. Sequence 400e Also, for speaker k, there is an even-numbered hypothesis 40 (2 is even) and other even-numbered hypotheses. WC symbols 144a are inserted between the hypotheses to indicate changes in the window corresponding to the transition from one audio segment to the next. Sequences 400o and 400e are L-pair word pairs.<o1,e1> ,<o2,e2> ,… <o L e L > indicates word alignment, where WC can be an o for a pair of l. l or e l Sequences 400o and 400e are then merged into a merged hypothesis set 150a (or another merged hypothesis set), which is then summarized by hypothesis splicer 300.
[0037] Figure 5 Two examples of hypothesis set 150a for a serial splicer version of hypothesis splicer 300 are shown, alternatively shown as hypothesis set 510 or hypothesis set 520. In this case, the overlap may be less than 50% because word alignment is not used. In some examples, as shown for hypothesis set 510, hypotheses for speaker k are concatenated, which have both odd and even hypotheses, such as odd hypothesis 501. Even number assumption 502 And WC symbols 144a inserted between the hypotheses. Hypothesis set 520 is similar, although it uses WC symbols that are specified as even and odd, such as WCE symbol 144b and WCO symbol 144c. Hypothesis sets 510 or 520 are provided as merged hypothesis set 150a (or another merged hypothesis set) for summarization by hypothesis splicer 300.
[0038] Figures 6A-6DA multi-speaker dataset is shown, where hypotheses for all speakers are summarized into a single sequence, and SPKR notation is used as one of the splicing notations 144. In some examples, the SPRK notation indicates the number of speakers k, such as SPKR_k. To make relatively long sequences easily scalable... Figures 6A-6D As shown, the sequence uses W1, W2, ... to indicate the hypothesis about the words in merging hypothesis 150 (e.g., SPKR_k is abbreviated as S1, S2, S3, S4 and S5 respectively, where k = 1, 2, 3, 4 and 5. That is to say, S1 is the abbreviation of the first speaker (k = 1) SPKR_k symbol.
[0039] Figure 6A Two variations for sorting SPKR symbols are shown. Sequence 602 sorts according to the order in which the speaker first appears in audio stream 102 (and transcript 170). Sequence 604 shows another possible variation sorting according to the order of the speaker profiles in speaker profile 122. It should be noted that due to overlap in windows (e.g., audio segments 111-115), some hypotheses (e.g., words preceding the WC, WCE, or WCO symbols) in subsequent windows correspond. That is, for example, W11 and W12 can correspond to W9 and W10. The output of hypothesis splicer 300 is shown in sequence 606, where W1', W2', ... indicate the words selected by hypothesis splicer 300 for summary hypothesis 160a (or another summary hypothesis). In some examples, speaker symbols are combined with other splicing symbols (e.g., <wco>and <wce>They are assigned together and can take the form of a special identifier, such as <sn>, where n indicates the speaker number (according to the index in speaker profile 122).
[0040] Figure 6B A variation is shown that uses SPKR symbols (e.g., summary hypothesis 160a or another summary hypothesis) from the output of hypothesis splicer 300 as splicing symbols. Sequence 612 shows a variation with... Figure 6A Sequence 602 is the same sequence of words (or tokens), but each word has a comment. Special SPKR symbols can be used. <s0>Each splicing symbol is assigned. The output of splicer 300 may alternatively be in the format of sequence 614, where each word (or token) is annotated with an SPKR symbol, or it may be in the format of sequence 616, where the SPKR symbol is used at the end of the word sequence attributed to a specific speaker (when subsequent words are attributed to different speakers). In the example shown, sequence 606 is similar to sequence 616.
[0041] Figure 6C A variation is shown that uses LANG symbols (e.g., summarizing hypothesis 160a or another summarizing hypothesis) from the output of hypothesis splicer 300 as splicing symbols. Sequence 622 is similar to Figure 6B The output of the splicer 300 is sequence 612, but with language annotations added after each word. It is assumed that the output of the splicer 300 could alternatively be in the format of sequence 624, where each word (or token) is annotated with a LANG symbol, or it could be in the format of sequence 626, where a LANG symbol is used at the end of word sequences attributed to a specific speaker or detected language variation. In the format of sequence 622, special LANG symbols can be used... <l0>Assigned to each splice symbol. Other speaker attributes (speaker age, accent, etc.) can also be embedded. When age annotations are included with splice symbols, the annotations can take the form of symbols that incorporate the actual numerical value estimated for the speaker.
[0042] Figure 6D The diagram illustrates the use of multiple ranking hypotheses, such as N best hypotheses estimated by the E2E SRE 130. In some examples, the E2E SRE 130 is capable of outputting more than a single hypothesis for each detected word, and outputting multiple ranking hypotheses (ranked according to the correct probability). Sequence 632 illustrates a single speaker scenario where a set of four words is identified as the best guess (N1) being W1, W2, W3, and W4, the second best guess (N2) being W1*, W2*, W3*, and W4*, and the third best guess (N3) being W1**, W2**, W3**, and W4**. Alternative sequence 634 illustrates the use of the best guess symbol N* to replace a specific guess ranking value, where the ranking is inferred by this order. N1, N2, N3, and / or N* are used as splicing symbols in the input of hypothesis splicer 300. The output of hypothesis splicer 300 is shown as sequence 636 with the selected words W1', W2', W3', and W4'. The multi-speaker version uses sequence 638, where the speaker is annotated after each group of N best guesses. Assume the output of splicer 300 is shown as sequence 640, with selected words W1', W2', W3', W4', W5', W6', W7', W8', W9', and W10'.
[0043] Figure 7 This is a flowchart 700 illustrating example operations involved in performing speech recognition. In some examples, the operations described for flowchart 700 are performed by... Figure 9 The computational device 900 performs the operation. Flowchart 700 begins with operation 702, which includes training an E2E SRE 130. In some examples, a joint model SRE (e.g., an end-to-end speaker-attributed ASR model) and speaker identifiers are trained in operation 702, and the joint model is used in subsequent operations 712, 714, 716. Operation 704 includes training a hypothesis splicer 300 with hypothesis splicer training data 312, which includes splicing symbols. In some examples, the hypothesis splicer training data 312 has overlaps similar to those in the merged hypothesis set 150a. Operation 706, included in operation 704, includes inserting errors 314 into the hypothesis splicer training data 312. In some examples, the inserted errors 314 include incorrectly identified words (in overlapping areas of fragments) or incorrect speaker identifiers. In some examples, the hypothesis splicer 300 includes an encoder and a decoder. In some examples, the hypothesis splicer 300 includes a neural network. In some examples, it is assumed that splicer 300 includes a converter-based attention encoder-decoder architecture. In some examples, it is assumed that splicer 300 includes an alignment-based splicer. In some examples, it is assumed that splicer 300 includes a serial splicer that does not use parity assumption sequences for alignment. In some examples, it is assumed that splicer 300 uses 25% overlap or less.
[0044] Operation 708 includes receiving audio stream 102. Operation 710 includes segmenting audio stream 102 into multiple audio segments 110. In some examples, the duration of each audio segment is less than one minute. Operation 712 includes identifying multiple speakers 106 (e.g., identifying each speaker among speakers 106a-106c) within each audio segment of the multiple audio segments 110 (e.g., within audio stream 102). Operation 714 includes determining speaker characteristics. In some examples, speaker characteristics are selected from a list including: language, speaker age, and accent. Operation 716 includes performing ASR on each audio segment of the multiple audio segments 110 (e.g., audio segments 111-115) to generate multiple short segment hypotheses 138. In some examples, performing ASR includes performing E2E ASR. In some examples, operations 712-716 are performed as a single operation using a common federated model (e.g., the end-to-end speaker-attributed ASR model described above). In some examples, the short fragment assumes that 138 includes tokens representing words.
[0045] In some examples, short-segment hypothesis 138 is speaker-specific, so operations 718-728 are performed for each speaker using a speaker-specific dataset (e.g., short-segment hypothesis 138, merge hypothesis 156, and summary hypothesis 160). In some examples, short-segment hypothesis 138 is multi-speaker-specific, so subsequent operations 718-728 are performed using a multi-speaker dataset (e.g., multi-speaker version of short-segment hypothesis 138, merge hypothesis set 156a, and summary hypothesis 160a).
[0046] Operation 718 includes merging at least a portion of short-segment hypothesis 138 into a merged hypothesis set 150a, and is performed using operations 720-724. In some examples, merging at least a portion of short-segment hypothesis 138 into merged hypothesis set 150a includes summarizing tokenized hypotheses. In some examples, merged hypothesis set 150a includes a multi-speaker merged hypothesis set. In some examples, merged hypothesis set 150a includes hypothesis ranking. In some examples, merged hypothesis set 150a is specific to the first speaker among the multiple speakers 106, therefore operation 718 also includes merging at least a portion of short-segment hypothesis 138 into a merged hypothesis set 150b specific to the second speaker among the multiple speakers 106. Operation 720 includes grouping individual ASR results by speaker.
[0047] Operation 722 includes inserting a splicing symbol 144 into the merge hypothesis set 150a, the splicing symbol 144 including WC symbols (e.g., WC symbol 144a, WCE symbol 144b, and / or WCO symbol 144c). In some examples, operation 722 also includes inserting a speaker characteristic label as a splicing symbol 144 into the merge hypothesis set 150a, at least based on determined speaker characteristics. In some examples, the splicing symbol 144 includes at least one symbol selected from a list including: WC symbols, WCE symbols, WCO symbols, SPKR symbols, SPKR_k symbols, and speaker characteristic symbols or values.
[0048] In speaker-specific scenarios, operation 722 also includes inserting splicing symbols into the merge hypothesis set 150b. In some examples, splicing symbols 144 also include a speaker identifier (e.g., SPKR_k). In examples using an alignment-based splicer (see...),... Figure 4 The merge hypothesis set 150a includes an odd hypothesis sequence and an even hypothesis sequence, and operation 724 includes aligning the odd hypothesis sequence with the even hypothesis sequence. In some examples, aligning the odd hypothesis sequence with the even hypothesis sequence includes pairing words or tokens in the odd sequence with words or tokens in the even sequence.
[0049] Operation 726 includes using a (network-based) hypothesis splicer 300 to aggregate a merged hypothesis set 150a into a summarized hypothesis 160a. In some examples, summarized hypothesis 160a includes tokens representing words. In some examples, summarized hypothesis 160a is speaker-specific (e.g., one of speakers 106a-106c). In the case of speaker-specific summarized hypothesis 160a, operation 726 also includes using hypothesis splicer 300 to aggregate a merged hypothesis set 150b into a second speaker-specific summarized hypothesis 160b.
[0050] Operation 728 includes outputting summary hypothesis 160a as transcript 170. If summary hypothesis 160a is a speaker-specific summary hypothesis, operations 730-734 are used to generate a multi-speaker version of transcript 170. However, if operation 728 has already output a multi-speaker version of transcript 170, operations 730-734 may be unnecessary. Operation 730 includes merging summary hypothesis 160a and summary hypothesis 160b to generate a multi-speaker version of transcript 170. Operation 732 includes identifying speakers 106a-106c within the multi-speaker version of transcript 170. Operation 734 includes outputting a multi-speaker version of transcript 170 if operation 728 only outputs a speaker-specific version of transcript 170.
[0051] Figure 8 This is a flowchart 800 illustrating exemplary operations involved in performing speech recognition. In some examples, the operations described for flowchart 800 are performed by... Figure 9 The computational device 900 performs the operation. Flowchart 800 begins with operation 802, which includes segmenting an audio stream into multiple audio segments. Operation 804 includes identifying multiple speakers within the audio stream (e.g., within the multiple audio segments). Operation 806 includes performing ASR on each of the multiple audio segments to generate multiple short-segment hypotheses. Operation 808 includes merging at least a portion of the short-segment hypotheses into a first merged hypothesis set. Operation 810 includes inserting splicing symbols, including WC symbols, into the first merged hypothesis set. Operation 812 includes summarizing the first merged hypothesis set into a first summarized hypothesis using a network-based hypothesis splicer.
[0052] Additional examples
[0053] Example methods for speech recognition include: segmenting an audio stream into multiple audio segments; identifying multiple speakers within the audio stream; performing ASR on each of the multiple audio segments to generate multiple short segment hypotheses; merging at least a portion of the short segment hypotheses into a first merged hypothesis set; inserting splicing symbols, including WC symbols, into the first merged hypothesis set; and summarizing the first merged hypothesis set into a first summarized hypothesis using a network-based hypothesis splicer.
[0054] An example system for speech recognition includes: a processor; and a computer-readable medium storing instructions operable, when executed by the processor, to: segment an audio stream into multiple audio segments; identify multiple speakers within the audio stream; perform ASR on each of the multiple audio segments to generate multiple short segment hypotheses; merge at least a portion of the short segment hypotheses into a first merged hypothesis set; insert splicing symbols, including WC symbols, into the first merged hypothesis set; and aggregate the first merged hypothesis set into a first aggregated hypothesis using a network-based hypothesis splicer.
[0055] One or more example computer storage devices have computer-executable instructions stored thereon that, when executed by a computer, cause the computer to perform operations including: segmenting an audio stream into multiple audio segments; identifying multiple speakers within the audio stream; performing ASR on each of the multiple audio segments to generate multiple short segment hypotheses; merging at least a portion of the short segment hypotheses into a first merged hypothesis set; inserting splicing symbols into the first merged hypothesis set, the splicing symbols including window change (WC) symbols; and summarizing the first merged hypothesis set into a first summarized hypothesis using a network-based hypothesis splicer.
[0056] Alternatively, or in addition to the other examples described herein, examples may include any combination of the following:
[0057] - The first summary assumption is used as the output;
[0058] - The first set of merging assumptions is specific to the first speaker among multiple speakers;
[0059] - The first summary hypothesis is specific to the first speaker;
[0060] - Incorporate at least a portion of the short-segment hypothesis into a second merge hypothesis set specific to a second speaker among multiple speakers;
[0061] - Insert the splicing symbol into the second merge hypothesis;
[0062] - Use the hypothesis splicer to summarize the second merged hypothesis set into a second summary hypothesis specific to the second speaker;
[0063] - The first set of merging hypotheses includes the set of multi-speaker merging hypotheses;
[0064] - Spelling symbols also include speaker identifiers;
[0065] - Assume the splicer includes an alignment-based splicer;
[0066] - The first merged hypothesis set includes both odd-numbered hypothesis sequences and even-numbered hypothesis sequences;
[0067] - Align the odd hypothesis sequence with the even hypothesis sequence;
[0068] -Assuming the splicer includes a serial splicer that does not use parity assumption sequence alignment;
[0069] -Assuming the splicer uses 25% or less overlap;
[0070] - The first merged hypothesis set includes hypothesis ranking;
[0071] - Performing ASR includes performing E2E ASR;
[0072] - The short-segment hypothesis and the first summary hypothesis include tokens representing words;
[0073] Aligning odd hypothesis sequences with even hypothesis sequences involves pairing words or tokens in the odd sequence with words or tokens in the even sequence;
[0074] -Assume the splicer includes an encoder and a decoder;
[0075] -Assume the splicer includes a neural network;
[0076] - Assume the splicer includes a converter-based attention encoder-decoder architecture.
[0077] - Determine the speaker's characteristics;
[0078] - Speaker characteristics are selected from a list that includes the following: language, speaker age, and accent;
[0079] -Based at least on the determined speaker characteristics, insert speaker characteristic labels as splicing symbols into the first set of merged hypotheses;
[0080] - The splicing symbols also include at least one symbol selected from a list including the following: WCE symbol, WCO symbol, SPKR symbol, numbered speaker symbol, and speaker characteristic symbol or value;
[0081] - Train the hypothetical splicer using training data that includes splicing symbols;
[0082] - Insert errors into the training data of the hypothesis splicer;
[0083] - The inserted errors include incorrectly identified words or incorrect speaker identifiers in overlapping segments;
[0084] - Assume that the splicer training data has similar overlap to the overlap in the first set of merging hypotheses;
[0085] - Incorporating at least a portion of the short-fragment hypotheses into the first merged hypothesis set includes the aggregated tokenized hypothesis;
[0086] - Combine the first and second summary hypotheses to generate a multi-speaker transcript;
[0087] - Identify the speaker in multi-speaker transcripts;
[0088] - Output the transcript of the speaker;
[0089] - Receive audio stream;
[0090] - The duration of the audio clip is less than one minute; and
[0091] - Train E2E SRE.
[0092] While aspects of this disclosure have been described with reference to various examples and their associated operations, those skilled in the art will understand that combinations of operations from any number of different examples are also within the scope of aspects of this disclosure.
[0093] Example operating environment
[0094] Figure 9 This is a block diagram of an example computing device 900 used to implement the aspects disclosed herein, and is generally designated as computing device 900. Computing device 900 is merely one example of a suitable computing environment and is not intended to suggest any limitation on the scope or functionality of the examples disclosed herein. Nor should computing device 900 be construed as having any dependencies or requirements associated with any or a combination of the shown components / modules. The examples disclosed herein can be described in the general context of computer code or machine-usable instructions, including computer-executable instructions such as program components, which are executed by a computer or other machine such as a personal digital assistant or other handheld device. Generally, program components, including routines, programs, objects, components, data structures, etc., refer to code that performs a specific task or implements a specific abstract data type. The disclosed examples can be implemented in a variety of system configurations, including personal computers, laptops, smartphones, mobile tablets, handheld devices, consumer electronics, dedicated computing devices, etc. The disclosed examples can also be implemented in a distributed computing environment when performing tasks via a remote processing device linked through a communication network.
[0095] Computing device 900 includes a bus 910 that directly or indirectly couples to the following devices: computer storage memory 912, one or more processors 914, one or more presentation components 916, I / O ports 918, I / O components 920, power supply 922, and network components 924. Although computing device 900 is shown as a seemingly single device, multiple computing devices 900 can work together and share the illustrated device resources. In one example embodiment, memory 912 is distributed across multiple devices, and processor(s)(s)914 house different devices.
[0096] Bus 910 can be one or more buses (e.g., address bus, data bus, or a combination thereof). Although for clarity, Figure 9 The individual blocks are shown with lines, but alternative representations can be used to delineate various components. For example, in some examples, the rendering components of a display device are I / O components, and some examples of processors have their own memory. Figure 9 As envisioned within the scope of this document, no distinction is made between categories such as "workstation," "server," "laptop," or "handheld device," and the term "computing device" is used herein. Memory 912 may take the form of computer storage media referenced below and is operable to provide storage for computer-readable instructions, data structures, program modules, and other data for computing device 900. In some examples, memory 912 stores one or more of an operating system, a general-purpose application platform, or other program modules and program data. Therefore, memory 912 is capable of storing and accessing data 912a and instructions 912b, which can be executed by processor 914 and configured to perform the various operations disclosed herein.
[0097] In some examples, memory 912 includes volatile and / or non-volatile memory, removable or non-removable memory, a data disk in a virtual environment, or a combination thereof, as a form of computer storage medium. Memory 912 may include any number of memories associated with or accessible by computing device 900. Memory 912 may be internal to computing device 900 (e.g., Figure 9 The memory 912 may be located outside the computing device 900 (not shown), or both (not shown). Examples of memory 912 include, but are not limited to, random access memory (RAM); read-only memory (ROM); electrically erasable programmable read-only memory (EEPROM); flash memory or other memory technologies; CD-ROM, digital versatile disc (DVD) or other optical or holographic media; magnetic tape cassette, magnetic tape, disk storage or other magnetic storage devices; memory connected to an analog computing device; or any other medium used to encode desired information and for access by the computing device 900. Additionally or alternatively, the memory 912 may be distributed across multiple computing devices 900, for example, in a virtual environment where instruction processing is performed on multiple devices 900. For the purposes of this disclosure, "computer storage medium," "computer storage memory," "memory," and "storage device" are synonymous terms for computer storage memory 912, and none of these terms include carrier waves or propagation signaling.
[0098] The (multiple) processors 914 may include any number of processing units that read data from various entities such as memory 912 or I / O components 920. Specifically, the (multiple) processors 914 are programmed to execute computer-executable instructions for implementing aspects of this disclosure. These instructions may be executed by a processor, by multiple processors within computing device 900, or by a processor external to client computing device 900. In some examples, the (multiple) processors 914 are programmed to execute instructions such as those shown in the flowcharts discussed below and described in the accompanying drawings. Furthermore, in some examples, the (multiple) processors 914 represent an implementation of analog technology for performing the operations described herein. In one example embodiment, the operations are performed by analog client computing device 900 and / or digital client computing device 900. Multiple presentation components 916 present data indications to a user or other device. Exemplary presentation components include display devices, speakers, printing components, vibration components, etc. Those skilled in the art will understand and recognize that computer data can be presented in a variety of ways, such as visually in a graphical user interface (GUI), audibly by a speaker, wirelessly between computing devices 900, via a wired connection, or otherwise. I / O port 918 allows computing device 900 to be logically coupled to other devices including I / O components 920, some of which may be built-in. Example I / O components 920 include, for example, but not limited to, microphones, joysticks, gamepads, satellite antennas, scanners, printers, wireless devices, etc.
[0099] Computing device 900 can operate in a networked environment via a logical connection to one or more remote computers through network component 924. In some examples, network component 924 includes a network interface card and / or computer-executable instructions (e.g., drivers) for operating a network interface card. Communication between computing device 900 and other devices can occur over any wired or wireless connection using any protocol or mechanism. In some examples, network component 924 is operable to transmit data over public, private, or hybrid (public and private) networks using transport protocols, and to use short-range communication technologies (e.g., Near Field Communication (NFC), Bluetooth). TM (e.g., brand communication) wirelessly transmits data, or a combination thereof, between devices. Network component 924 communicates with cloud resource 928 across network 930 via wireless communication link 926 and / or wired communication link 926a. Various examples of communication links 926 and 926a include wireless connections, wired connections, and / or dedicated links, and in some examples, at least a portion is routed over the Internet.
[0100] Although described in conjunction with example computing device 900, the examples of this disclosure can be implemented using many other general-purpose or special-purpose computing system environments, configurations, or devices. Examples of known computing systems, environments, and / or configurations suitable for use with aspects of this disclosure include, but are not limited to, smartphones, mobile tablets, mobile computing devices, personal computers, server computers, handheld or laptop devices, multiprocessor systems, game consoles, microprocessor-based systems, set-top boxes, programmable consumer electronics, mobile phones, mobile computing and / or wearable or accessory-type communication devices (e.g., watches, glasses, headphones, or in-ear headphones), network PCs, minicomputers, mainframe computers, distributed computing environments including any of the above systems or devices, virtual reality (VR) devices, augmented reality (AR) devices, mixed reality (MR) devices, holographic devices, etc. Such systems or devices can accept input from users in any manner, including input devices such as keyboards or pointing devices, input via gestures, proximity input (such as by hover), and / or input via voice.
[0101] Examples of this disclosure can be described in the general context of computer-executable instructions, such as program modules, which are executed by one or more computers or other devices, including software, firmware, hardware, or combinations thereof. Computer-executable instructions can be organized into one or more computer-executable components or modules. Typically, program modules include, but are not limited to, routines, programs, objects, components, and data structures that perform a particular task or implement a particular abstract data type. Aspects of this disclosure can be implemented with any number and organization of such components or modules. For example, aspects of this disclosure are not limited to the specific computer-executable instructions or specific components or modules shown in the figures and described herein. Other examples of this disclosure may include different computer-executable instructions or components having more or fewer functions than those shown and described herein. In examples involving general-purpose computers, aspects of this disclosure transform a general-purpose computer into a special-purpose computing device when configured to execute the instructions described herein.
[0102] By way of example and not limitation, computer-readable media include computer storage media and communication media. Computer storage media include volatile and non-volatile, removable and non-removable memory implemented in any method or technology for storing information such as computer-readable instructions, data structures, program modules, etc. Computer storage media are tangible and incompatible with communication media. Computer storage media are implemented in hardware and do not include carrier waves and propagating signals. For the purposes of this disclosure, computer storage media are not signals themselves. Exemplary computer storage media include hard disks, flash drives, solid-state memory, phase-change random access memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, optical disc read-only memory (CD-ROM), digital versatile disc (DVD) or other optical storage, cassette tape, magnetic tape, disk storage or other magnetic storage devices, or any other non-transfer medium that can be used to store information for access by a computing device. In contrast, communication media typically contain computer-readable instructions, data structures, program modules, etc., in modulated data signals such as carrier waves or other transmission mechanisms, and include any information delivery medium.
[0103] The order in which operations are performed or carried out in the examples of this disclosure shown and described herein is not required, and they may be performed in different orders in various examples. For example, a particular operation may be performed or carried out before, simultaneously with, or after another operation within the scope of aspects of this disclosure. When introducing elements of aspects of this disclosure or examples thereof, the articles "a," "an," "the," and "described" are intended to mean that there are one or more elements. The terms "comprising," "including," and "having" are intended to be inclusive and mean that there may be additional elements in addition to the listed elements. The term "exemplary" is intended to mean "an example." The phrase "one or more of A, B, and C" means "at least one A and / or at least one B and / or at least one C."
[0104] Having described in detail the various aspects of this disclosure, it will be apparent that modifications and variations are possible without departing from the scope of the various aspects of this disclosure as defined in the appended claims. Since various changes can be made to the above-described structures, products, and methods without departing from the scope of the various aspects of this disclosure, it is intended that all content contained in the foregoing description and shown in the accompanying drawings be interpreted as illustrative rather than restrictive. < / sn> < / wce> < / wco>
Claims
1. A speech recognition method, the method comprising: Split the audio stream into multiple audio segments; Identify multiple speakers within the audio stream; Perform automatic speech recognition (ASR) on each of the plurality of audio segments to generate a plurality of short segment hypotheses; The first part of the short segment hypothesis is merged into the first merge hypothesis set of the first speaker, which is specific to the multiple speakers; The second part of the short segment hypothesis is merged into a second merged hypothesis set for a second speaker, which is specific to the plurality of speakers and is different from the first speaker; Insert splicing symbols into the first merge hypothesis set and the second merge hypothesis set, wherein the splicing symbols include the window change WC symbol; as well as Using a network-based hypothesis stitcher, the first merged hypothesis set is aggregated into a first aggregated hypothesis, and the second merged hypothesis set is aggregated into a second aggregated hypothesis. The hypothetical splicer includes either an alignment-based splicer or a serial splicer, and the serial splicer does not use the alignment of odd and even hypothesis sequences.
2. The method according to claim 1, further comprising: The first summary hypothesis is used as the first transcription output, and the second summary hypothesis is used as the second transcription output.
3. The method of claim 1, wherein the first set of merging hypotheses is specific to a first speaker among the plurality of speakers, wherein the first summarizing hypothesis is specific to the first speaker.
4. The method of claim 1, wherein the first merging hypothesis set includes a multi-speaker merging hypothesis set, and wherein the splicing symbol further includes a speaker identifier.
5. The method of claim 1, wherein the hypothesis splicer includes the alignment-based splicer, wherein the first merged hypothesis set includes an odd hypothesis sequence and an even hypothesis sequence, and wherein the method further comprises: Align the odd hypothesis sequence with the even hypothesis sequence.
6. The method of claim 1, wherein the assumed splicer uses 25% overlap or less.
7. A system for speech recognition, the system comprising: processor; as well as A computer-readable medium storing instructions that, when executed by the processor, operate as follows: Split the audio stream into multiple audio segments; Identify multiple speakers within the audio stream; Perform automatic speech recognition (ASR) on each of the plurality of audio segments to generate a plurality of short segment hypotheses; The first part of the short segment hypothesis is merged into the first merge hypothesis set of the first speaker, which is specific to the multiple speakers; The second part of the short segment hypothesis is merged into a second merged hypothesis set for a second speaker, which is specific to the plurality of speakers and is different from the first speaker; Insert splicing symbols into the first merge hypothesis set and the second merge hypothesis set, wherein the splicing symbols include the window change WC symbol; as well as Using a network-based hypothesis stitcher, the first merged hypothesis set is aggregated into a first aggregated hypothesis, and the second merged hypothesis set is aggregated into a second aggregated hypothesis. The hypothetical splicer includes either an alignment-based splicer or a serial splicer, and the serial splicer does not use the alignment of odd and even hypothesis sequences.
8. The system of claim 7, wherein the first merged hypothesis set includes a hypothesis ranking.
9. The system of claim 7, wherein the first set of merging hypotheses is specific to a first speaker among the plurality of speakers, wherein the first summarizing hypothesis is specific to the first speaker.
10. The system of claim 7, wherein the first merging hypothesis set includes a multi-speaker merging hypothesis set, and wherein the splicing symbol further includes a speaker identifier.
11. The system of claim 7, wherein the hypothesis splicer includes the alignment-based splicer, wherein the first merged hypothesis set includes an odd hypothesis sequence and an even hypothesis sequence, and wherein the instruction further operates to: Align the odd hypothesis sequence with the even hypothesis sequence.
12. The system of claim 7, wherein the assumed splicer uses 25% overlap or less.
13. One or more computer storage devices storing computer-executable instructions, which, when executed by a computer, cause the computer to perform operations, said operations including: Split the audio stream into multiple audio segments; Identify multiple speakers within the audio stream; Perform automatic speech recognition (ASR) on each of the plurality of audio segments to generate a plurality of short segment hypotheses; The first part of the short segment hypothesis is merged into the first merge hypothesis set of the first speaker, which is specific to the multiple speakers; The second part of the short segment hypothesis is merged into a second merged hypothesis set for a second speaker, which is specific to the plurality of speakers and is different from the first speaker; Insert splicing symbols into the first merge hypothesis set and the second merge hypothesis set, wherein the splicing symbols include the window change WC symbol; as well as Using a network-based hypothesis stitcher, the first merged hypothesis set is aggregated into a first aggregated hypothesis, and the second merged hypothesis set is aggregated into a second aggregated hypothesis. The hypothetical splicer includes either an alignment-based splicer or a serial splicer, and the serial splicer does not use the alignment of odd and even hypothesis sequences.
14. The computer storage device of claim 13, wherein the operation further comprises: The first summary hypothesis is used as the first transcription output, and the second summary hypothesis is used as the second transcription output.
15. One or more computer storage devices according to claim 13, wherein the first set of merging hypotheses is specific to a first speaker among the plurality of speakers, wherein the first summarizing hypothesis is specific to the first speaker.
16. The computer storage device of claim 13, wherein the first merging hypothesis set includes a multi-speaker merging hypothesis set, and wherein the splicing symbol further includes a speaker identifier.
17. The computer storage device of claim 13, wherein the hypothesis splicer includes the alignment-based splicer, wherein the first merged hypothesis set includes an odd hypothesis sequence and an even hypothesis sequence, and wherein the operation further includes: Align the odd hypothesis sequence with the even hypothesis sequence.
18. One or more computer storage devices according to claim 13, wherein the hypothetical splicer uses 25% overlap or less.