Speech recognition method, device and storage medium
By introducing an attention layer into the text encoder of the speech recognition model, the contextual information of the sentence is integrated, which solves the accuracy problem of speech recognition models in multi-speaker scenarios and achieves higher speaker recognition accuracy.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- ALIBABA (CHINA) CO LTD
- Filing Date
- 2023-06-08
- Publication Date
- 2026-06-19
AI Technical Summary
Existing end-to-end speaker-related speech recognition models fail to achieve accurate speech recognition results in multiple speaker scenarios, mainly due to a lack of awareness of contextual information, resulting in inaccurate speaker identification for each word.
A text encoder with an attention layer is introduced into the speech recognition model. By encoding the first n-1 words of the speech recognition decoder output, the contextual information of the sentence is integrated to improve the accuracy of the speaker vector representation.
By integrating contextual information, the accuracy of speaker recognition results for each word is improved, ensuring that the speech recognition model can more accurately determine the speaker corresponding to each word.
Smart Images

Figure CN116825095B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of artificial intelligence technology, and in particular to a speech recognition method, device, and storage medium. Background Technology
[0002] In many scenarios, speech recognition challenges arise from multiple speakers. For example, in a meeting setting, there may be two or more speakers, requiring simultaneous recognition of their voices to determine what each speaker is saying. While current speech recognition technology can accurately identify the content of a single person, the recognition rate drops significantly when there are two or more speakers.
[0003] End-to-end speaker-related speech recognition models can be used to recognize the speech of multiple speakers. However, existing end-to-end speaker-related speech recognition models cannot achieve highly accurate speech recognition results. Summary of the Invention
[0004] This invention provides a speech recognition method, device, and storage medium to improve the accuracy of speech recognition results from multiple speakers.
[0005] In a first aspect, embodiments of the present invention provide a speech recognition method, wherein the speech recognition model includes a speech recognition encoder, a speaker encoder, a speech recognition decoder, a speaker decoder, and a text encoder including an attention layer, the method comprising:
[0006] Acquire speech signals from multiple speakers and profile feature vectors of the multiple speakers, wherein the speech signals contain the speech of the multiple speakers;
[0007] The speech recognition encoder obtains a first vector representation corresponding to the speech signal, and the speaker encoder obtains a second vector representation corresponding to the speech signal. The first vector representation is used for speech recognition, and the second vector representation is used for speaker recognition.
[0008] The text encoder encodes the first n-1 characters output by the speech recognition decoder to obtain the third vector representation corresponding to the n-1th character;
[0009] The first vector representation, the second vector representation, and the third vector representation are input into the speaker decoder to obtain the speaker vector representation corresponding to the nth word;
[0010] The speaker corresponding to the nth character is determined based on the correlation coefficients between the speaker vector representation corresponding to the nth character and the profile feature vectors of the multiple speakers.
[0011] The first vector representation, the first n-1 characters, and the weighted sum of the correlation coefficients and the feature vectors of the profiles of the multiple speakers are input into the speech recognition decoder to obtain the nth character.
[0012] In a second aspect, embodiments of the present invention provide a speech recognition device, wherein the speech recognition model includes a speech recognition encoder, a speaker encoder, a speech recognition decoder, a speaker decoder, and a text encoder including an attention layer, the device comprising:
[0013] The acquisition module is used to acquire the speech signals of multiple speakers and the portrait feature vectors of the multiple speakers, wherein the speech signals contain the speech of the multiple speakers;
[0014] The first encoding module is used to obtain a first vector representation corresponding to the speech signal through the speech recognition encoder, and to obtain a second vector representation corresponding to the speech signal through the speaker encoder. The first vector representation is used for speech recognition, and the second vector representation is used for speaker recognition.
[0015] The second encoding module is used to encode the first n-1 characters output by the speech recognition decoder through the text encoder to obtain the third vector representation corresponding to the n-1th character;
[0016] The first decoding module is used to input the first vector representation, the second vector representation, and the third vector representation into the speaker decoder to obtain the speaker vector representation corresponding to the nth word; and to determine the speaker corresponding to the nth word based on the correlation coefficients between the speaker vector representation corresponding to the nth word and the profile feature vectors of the plurality of speakers.
[0017] The second decoding module is used to input the first vector representation, the first n-1 characters, and the weighted sum of the correlation coefficients and the portrait feature vectors of the multiple speakers into the speech recognition decoder to obtain the nth character.
[0018] Thirdly, embodiments of the present invention provide an electronic device, including: a memory, a processor, and a communication interface; wherein, the memory stores executable code, and when the executable code is executed by the processor, the processor performs the speech recognition method as described in the first aspect.
[0019] Fourthly, embodiments of the present invention provide a non-transitory machine-readable storage medium storing executable code, which, when executed by a processor of an electronic device, enables the processor to at least implement the speech recognition method as described in the first aspect.
[0020] Fifthly, embodiments of the present invention provide a speech recognition method, the method comprising:
[0021] The receiving terminal device triggers a request by calling a speech recognition service, the request including speech signals of multiple speakers and profile feature vectors of the multiple speakers, the speech signals containing the speech of the multiple speakers;
[0022] Based on the computing resources corresponding to the model training service, the following steps are performed:
[0023] Obtain a speech recognition model, which includes a speech recognition encoder, a speaker encoder, a speech recognition decoder, a speaker decoder, and a text encoder containing an attention layer;
[0024] The speech recognition encoder obtains a first vector representation corresponding to the speech signal, and the speaker encoder obtains a second vector representation corresponding to the speech signal. The first vector representation is used for speech recognition, and the second vector representation is used for speaker recognition.
[0025] The text encoder encodes the first n-1 characters output by the speech recognition decoder to obtain the third vector representation corresponding to the n-1th character;
[0026] The first vector representation, the second vector representation, and the third vector representation are input into the speaker decoder to obtain the speaker vector representation corresponding to the nth word;
[0027] The speaker corresponding to the nth character is determined based on the correlation coefficients between the speaker vector representation corresponding to the nth character and the profile feature vectors of the multiple speakers.
[0028] The first vector representation, the first n-1 characters, and the weighted sum of the correlation coefficients on the profile feature vectors of the multiple speakers are input into the speech recognition decoder to obtain the nth character.
[0029] The voice recognition output information is sent to the terminal device, and the voice recognition output information includes the text sequence corresponding to each of the plurality of speakers.
[0030] In the speech recognition scheme provided in the above embodiments of the present invention, the speech recognition model includes a speech recognition encoder, a speaker encoder, a speech recognition decoder, a speaker decoder, and a text encoder containing an attention layer. When performing speech recognition on speech signals from multiple speakers, the speech recognition encoder in the speech recognition model first encodes the speech signal to obtain a first vector representation for speech recognition, and the speaker encoder encodes the speech signal to obtain a second vector representation for speaker recognition. Then, the text encoder encodes the first n-1 characters output by the speech recognition decoder to obtain a third vector representation corresponding to the (n-1)th character. Next, the first, second, and third vector representations are input into the speaker decoder to obtain the speaker vector representation corresponding to the nth character. Based on the correlation coefficients between the speaker vector representation corresponding to the nth character and the profile feature vectors of multiple speakers, the speaker corresponding to the nth character is determined. Finally, the weighted sum of the first vector representation, the first n-1 characters, and the correlation coefficients with the profile feature vectors of multiple speakers is input into the speech recognition decoder to obtain the nth character.
[0031] In the above scheme, a text encoder with an attention layer is added to the speech recognition model. By encoding the first n-1 characters output by the speech recognition decoder through the text encoder, the context information of the whole sentence can be better aggregated to obtain the third vector representation containing context information corresponding to the n-1th character. In other words, a more accurate vector representation for speaker recognition can be obtained, thereby improving the accuracy of the speaker recognition results corresponding to each character. Attached Figure Description
[0032] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0033] Figure 1 A flowchart of a speech recognition method provided in an embodiment of the present invention;
[0034] Figure 2 This is a schematic diagram of the structure of a speech recognition model provided in an embodiment of the present invention;
[0035] Figure 3 This is a schematic diagram of the structure of the speech recognition encoder and speech recognition decoder provided in an embodiment of the present invention;
[0036] Figure 4This is a flowchart illustrating how to determine the speaker vector representation corresponding to the nth word using a speaker decoder.
[0037] Figure 5 A flowchart of another speech recognition method provided in an embodiment of the present invention;
[0038] Figure 6 This is a flowchart illustrating how to obtain the nth word using a speech recognition decoder.
[0039] Figure 7 This is a schematic diagram illustrating the application of a speech recognition method provided in an embodiment of the present invention;
[0040] Figure 8 A flowchart of another speech recognition method provided in an embodiment of the present invention;
[0041] Figure 9 A schematic diagram illustrating a speech recognition process in a cloud service mode, as provided in an embodiment of the present invention;
[0042] Figure 10 This is a schematic diagram of the structure of a speech recognition device provided in an embodiment of the present invention;
[0043] Figure 11 This is a schematic diagram of the structure of an electronic device provided in this embodiment. Detailed Implementation
[0044] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention. In addition, the timing of the steps in the following method embodiments is only an example and not a strict limitation.
[0045] It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, data stored, data displayed, etc.) involved in the embodiments of the present invention are all information and data authorized by the user or fully authorized by all parties. Furthermore, the collection, use and processing of related data must comply with the relevant laws, regulations and standards of the relevant countries and regions, and corresponding operation entry points are provided for users to choose to authorize or refuse.
[0046] The following is a brief introduction to some concepts involved in the embodiments of the present invention.
[0047] Speaker-Attributed Automatic Speech Recognition (SA-ASR) is a speech recognition task designed to solve the problem of "who said what".
[0048] Serialized Output Training (SOT): In speech transcription (i.e., speech-to-text), special delimiters are used to connect the text spoken by different speakers. The transcribed text is then arranged according to the chronological order in which the speakers began speaking, generating a text sequence, which is then output. For example, in a speech segment with two speakers, speaker A says "How's the weather today?" and speaker B says "The weather is nice today?", then during serialized output training of this speech segment, the output will be: "How's the weather today?" <sc>The weather is nice today, and each word or sentence is associated with a corresponding speaker tag.
[0049] Speaker Profile: This refers to a set of feature vectors that contain profile feature vectors of multiple speakers. It is possible to extract the profile feature vectors of each speaker from a segment of their speech signal.
[0050] End-to-End (E2E): Unlike modular processing of the target task, E2E involves a single, complete model processing the target task. The input is the raw data, and the output is the final processed result. During training, the overall objective function can be directly optimized.
[0051] Existing end-to-end speaker-related speech recognition models mainly include a speech recognition encoder, a speaker encoder, a speech recognition decoder, and a speaker decoder. These models lack a context-aware module. When recognizing speech from multiple speakers using these encoders and decoders, the contextual information corresponding to each word is not fully considered, leading to inaccurate predictions of the speaker corresponding to each word and affecting the final speech recognition result. To address the problem of inaccurate multi-speaker speech recognition results, this invention provides a novel speech recognition scheme. In this scheme, a text encoder with an attention layer is added to the speech recognition model. When determining the speaker vector representation corresponding to the nth word, this text encoder encodes the first n-1 words already output by the speech recognition decoder to integrate the sentence's contextual information, making the speaker vector representation corresponding to the nth word more accurate. This allows the speaker encoder to more accurately predict the speaker corresponding to each word.
[0052] The following detailed description of some embodiments of the present invention is provided in conjunction with the accompanying drawings. Where there is no conflict between the embodiments, the following embodiments and features thereof can be combined with each other.
[0053] Figure 1 A flowchart of a speech recognition method provided in an embodiment of the present invention is shown below. Figure 1 As shown, the method includes the following steps:
[0054] 101. Obtain the speech signals and portrait feature vectors of multiple speakers. The speech signals contain the speech of multiple speakers.
[0055] 102. Obtain a first vector representation of the speech signal through a speech recognition encoder, and obtain a second vector representation of the speech signal through a speaker encoder. The first vector representation is used for speech recognition, and the second vector representation is used for speaker recognition.
[0056] 103. The first n-1 characters output by the speech recognition decoder are encoded by a text encoder to obtain the third vector representation corresponding to the n-1th character.
[0057] 104. Input the first vector representation, the second vector representation, and the third vector representation into the speaker decoder to obtain the speaker vector representation corresponding to the nth word.
[0058] 105. Determine the speaker corresponding to the nth character based on the correlation coefficients between the speaker vector representation corresponding to the nth character and the portrait feature vectors of multiple speakers.
[0059] 106. Input the first vector representation, the first n-1 words, and the weighted sum of the correlation coefficients of the feature vectors of multiple speakers into the speech recognition decoder to obtain the nth word.
[0060] The speech recognition scheme provided in this embodiment of the invention can use a pre-trained speech recognition model to process the speech signals of multiple speakers to be recognized, so as to obtain each word corresponding to the speech signal and the speaker corresponding to each word. Since the training process of the speech recognition model is similar to the process of recognizing the speech signals of multiple speakers to be recognized using the speech recognition model, only the process of using the speech recognition model is described here.
[0061] The structure of the speech recognition model can be found in the appendix. Figure 2 As shown, the speech recognition model mainly includes a speech recognition encoder, a speaker encoder, a speech recognition decoder, a speaker decoder, and a text encoder containing an attention layer. The speech recognition encoder is mainly used to convert the speech signals from multiple speakers into a first vector representation for speech recognition. The speaker encoder is mainly used to convert the speech signals from multiple speakers into a second vector representation for speaker recognition. The speech recognition decoder is mainly used to complete speech-to-text recognition, identifying all the text spoken by each speaker in the speech signal, and finally outputting the speech recognition result corresponding to each speaker. The speaker decoder is mainly used to determine the correspondence between each character and the speaker. The text encoder is mainly used to integrate the contextual information of the text, determining the speaker vector representation corresponding to each character based on the contextual information, so that the obtained speaker vector representations corresponding to each character are more accurate.
[0062] In an optional embodiment, the specific structures of the speech recognition encoder and speech recognition decoder can be referred to the appendix. Figure 3 As shown. A speech recognition encoder can include multiple cascaded encoders, each of which can include two sub-layers: an attention layer and a feedforward neural network layer. A speech recognition decoder can include multiple cascaded decoders, each of which includes an attention layer and a feedforward neural network layer. Similarly, the structure of a speech encoder is specifically similar to that of a speech recognition encoder, which can be found in [reference needed]. Figure 3 The specific structure is shown below. The speaker encoder may include multiple cascaded encoders, and each encoder includes an attention layer and a feedforward neural network layer. Similarly, the speech decoder has a similar structure to the speech recognition decoder, with the speaker decoder including multiple cascaded decoders, each including an attention layer and a feedforward neural network layer. The text encoder includes at least one cascaded encoder, each including an attention layer and a feedforward neural network layer. The number of encoders included in the speech recognition encoder and speaker encoder can be set according to actual needs and is not limited here. Similarly, the number of decoders included in the speech recognition decoder and speaker decoder can be set according to actual needs. Furthermore, the attention layer in the decoder can include a self-attention layer and a source-target attention layer.
[0063] From the above description, we can see that a speech recognition model mainly includes a speech recognition encoder, a speaker encoder, a speech recognition decoder, a speaker decoder, and a text encoder containing an attention layer. Therefore, when using a speech recognition model to perform speech recognition on speech signals from multiple speakers, the specific execution process of each encoder and decoder in the speech recognition model is as follows: First, acquire the speech signals of multiple speakers and the portrait feature vectors of multiple speakers. The speech signal contains the speech of multiple speakers. The speaker portrait feature vector refers to the speaker's feature vector, which can be extracted from a segment of that speaker's speech signal.
[0064] Before processing the speech signals of multiple speakers, the profile features of the multiple speakers can be extracted in advance to obtain the profile feature vector of each speaker. Specifically, the speech signals of each speaker can be encoded using an encoder with the same structure as the speaker encoder to obtain the profile feature vector of each speaker.
[0065] After acquiring a speech signal containing multiple speakers, the speech signal can be processed to extract valid speech information and obtain the corresponding acoustic feature sequence. The specific implementation process may include: segmenting the acquired speech signal containing multiple speakers into frames, extracting acoustic features from each frame, and generating an acoustic feature sequence corresponding to each frame based on the acoustic features of each frame.
[0066] Next, the acoustic feature sequence corresponding to the speech signal is input into a speech recognition encoder to obtain a first vector representation of the speech signal. This first vector representation is used for speech recognition; that is, it characterizes the speech features of the speech signal. Then, the acoustic feature sequence corresponding to the speech signal is input into a speaker encoder to obtain a second vector representation of the speech signal. This second vector representation is used for speaker recognition; that is, it characterizes features such as the speaker's timbre in the speech signal.
[0067] In this embodiment, the speech recognition encoder can convert the acoustic feature sequence corresponding to the speech signal into a vector representation that can be used for speech recognition, so that subsequent speech recognition processing can be performed directly based on the first vector representation to obtain each word corresponding to the speech signal. The speaker encoder can convert the acoustic feature sequence corresponding to the speech signal into a vector representation for speaker recognition, so that subsequent speaker recognition processing can be performed directly based on the second vector representation for each word.
[0068] When processing speech signals for recognition and transcription, prediction is performed character by character. For example, after the speech recognition decoder determines the first character of the speech signal, this first character is fed back to the input of the speech recognition decoder to determine the second character, and so on, until the end character is encountered. Since the specific implementation process for predicting each character in the speech signal is basically the same, we will use the prediction process of the nth character as an example, assuming that the speech recognition decoder has already output the first n-1 characters. It should be noted that the first recognized character is predicted based on a set start character, or a null character, as the input of the speech recognition decoder.
[0069] When determining the nth character, the first step is to obtain the third vector representation containing contextual information corresponding to the (n-1)th character. This involves combining the information from the first (n-1)th characters to determine the nth character, resulting in a more accurate determination. Since the text encoder primarily integrates the contextual information of each character to obtain a vector representation containing this information, after obtaining the first (n-1)th characters output by the speech recognition decoder, the text encoder encodes these characters to obtain the third vector representation corresponding to the (n-1)th character.
[0070] It is important to note that when determining the third vector representation corresponding to the (n-1)th character, the text encoder first obtains the first (n-1) characters output by the speech recognition encoder after various processing steps, and encodes these first (n-1) characters to obtain the encoding result corresponding to the (n-1)th character as its third vector representation. For example, to obtain the third vector representation corresponding to the 4th character, the encoder first obtains the first 3 characters output by the speech recognition encoder, encodes these 3 characters, and obtains the third vector representation corresponding to the 4th character.
[0071] In one optional embodiment, the specific implementation process of encoding the first n-1 characters output by the speech recognition decoder using a text encoder to obtain the third vector representation corresponding to the (n-1)th character may include: obtaining the weighted vector representations of the attention coefficients corresponding to the first n-1 characters after the self-attention layer in the speech recognition decoder performs attention calculations on the first n-1 characters respectively; and inputting the weighted vector representations of the attention coefficients corresponding to the first n-1 characters into the text encoder to obtain the third vector representation corresponding to the (n-1)th character output by the text encoder. In other words, what is input to the text editor is the weighted encoded vector corresponding to each of the first n-1 characters output by the self-attention layer in the speech recognition decoder.
[0072] As mentioned above, a speech recognition decoder can include at least one decoder, and each decoder can include an attention layer, which can include several different attention layers, such as self-attention layers and source-target attention layers. To simplify computational complexity and preserve shallow semantic information for each character, the self-attention layer in the aforementioned speech recognition decoder can optionally be the first layer of the speech recognition decoder, i.e., the self-attention layer included in the first decoder. In this case, the outputs of the self-attention layer in the first layer of the speech recognition encoder for the first n-1 characters are collected, forming a sequence, which is then fed into the text encoder. In fact, the text encoder outputs the third vector representation of each of the first n-1 characters after fusing contextual information; here, the third vector representation corresponding to the n-1th character is simply extracted from the third vector representations corresponding to the first n-1 characters. The reason the text encoder can fuse textual contextual information is because it includes an attention layer, such as an attention layer employing a self-attention mechanism or other attention mechanisms; this paper uses a self-attention layer as an example.
[0073] Alternatively, for a speech recognition decoder, the attention calculation performed by the self-attention layer in the first layer on the first n-1 words can be expressed as the following formula: in, This represents the weighted vector representation of the attention coefficients corresponding to the (n-1)th character in the first layer output of the speech recognition decoder. This represents the embedding vector representation corresponding to the (n-1)th character in the first layer of the input speech recognition decoder. This represents the multi-head attention computation performed by the self-attention layer in the first layer of the speech recognition decoder. This represents the embedding vector representation of the first n-1 characters in the first layer of the input speech recognition decoder. The embedding vector representation is used to represent the vector representation after embedding encoding, and this embedding vector representation can be obtained by performing embedding encoding or embedding encoding and positional encoding on the first n-1 characters respectively. Specifically, the formula can be used... Obtain the embedding vector representations corresponding to the first n-1 words. Here, PosEnc represents positional encoding, Embed represents embedding encoding, and y... [1:n-1] This represents the first n-1 words output by the speech recognition decoder.
[0074] After obtaining the weighted vector representations of the attention coefficients for the first n-1 characters, the text editor processes these weighted vector representations to obtain the third vector representation for the (n-1)th character. In an optional embodiment, the text editor can use a formula... Calculate the third vector representation corresponding to the (n-1)th word. This represents the third vector representation corresponding to the (n-1)th character, where Context-Enc represents the context text encoding. This represents the weighted vector of attention coefficients corresponding to the first n-1 characters.
[0075] After obtaining the third vector representation, the first, second, and third vector representations are input into the speaker decoder to obtain the speaker vector representation corresponding to the nth word. The speaker vector representation can be used to characterize the speaker's features. The determined speaker vector representation corresponding to the nth word combines speech recognition features, speaker recognition features, and contextual information features of the speech signal. This results in a higher quality speaker vector representation for the nth word, thereby improving the accuracy of predicting the speaker for each word and making the transcription results corresponding to the speech signal more accurate.
[0076] After determining the speaker vector representation corresponding to the nth word using the speaker decoder, the speaker corresponding to the nth word is then determined based on the correlation coefficients between the speaker vector representation of the nth word and the portrait feature vectors of multiple speakers. The correlation coefficient characterizes the correlation between the speaker vector representation of each word and the portrait feature vectors of multiple speakers; in other words, it represents the correlation between the nth word and the various speakers contained in the speech signal. Therefore, in determining the speaker corresponding to the nth word, the correlation coefficients between the speaker vector representation of the nth word and the portrait feature vectors of multiple speakers are first obtained, and then the speaker corresponding to the nth word is determined based on these correlation coefficients.
[0077] For example, suppose the first speech signal contains three speakers, the speech vector corresponding to the nth word is qn, the feature vector of the first speaker is d1, the feature vector of the second speaker is d2, and the feature vector of the third speaker is d3. The correlation coefficient β between the speaker vector representation qn and d1 corresponding to the nth word is obtained. n,1 Obtain the correlation coefficient β between the speaker vector representation qn and d2 corresponding to the nth word. n,2 Obtain the correlation coefficient β between the speaker vector representation qn and d3 corresponding to the nth word. n,3 Then, based on these three correlation coefficients, the speaker corresponding to the nth word is determined. In an optional embodiment, the speaker corresponding to the largest correlation coefficient from multiple correlation coefficients can be selected as the speaker corresponding to the nth word.
[0078] Then, a weighted sum of multiple correlation coefficients on the feature vectors of multiple speakers is calculated, where the weighted sum can be viewed as the speaker vector representation corresponding to the nth word after weighting. Optionally, it can be based on the formula... Calculate the weighted sum of multiple correlation coefficients on the feature vectors of multiple speaker profiles. ,in, This represents the weighted sum, where K represents the total number of speakers in the speech signal, and β... n,k d represents the correlation coefficient between the speaker vector corresponding to the nth word and the profile feature vector of the kth speaker. k This represents the feature vector of the Kth speaker's profile.
[0079] Finally, input the first vector representation, the first n - 1 words, and the weighted sum of the portrait feature vectors of multiple speakers into the speech recognition decoder to obtain the nth word. The weighted sum is the weighted speaker vector representation corresponding to the nth word. When determining the nth word, introduce the weighted speaker vector representation corresponding to the nth word. In this way, the obtained nth word not only includes the specific content of the word but also the speaker corresponding to the word. For example, the specific content of the output nth word is "I", and the corresponding speaker is "Xiao Li".
[0080] According to the above method, each word in the speech signal and the speaker corresponding to each word can be determined in sequence. Finally, the recognition result in the SOT format corresponding to each speaker can be determined. For example, in a speech signal, it is recognized that there are two speakers, and the speaker corresponding to the recognized text "How's the weather today" is speaker a, and the speaker corresponding to the recognized text "The weather is nice today" is speaker b. The output content is: How's the weather today <sc>The weather is nice today.
[0081] In this embodiment of the invention, a text encoder containing an attention layer is added to the speech recognition model. By encoding the first n-1 characters output by the speech recognition decoder through the text encoder, the context information of the entire sentence can be better aggregated to obtain the third vector representation containing context information corresponding to the n-1th character. In other words, based on the first vector representation, the second vector representation, and the third vector representation, a more accurate speaker vector representation for speaker recognition can be obtained, thereby improving the accuracy of the speaker recognition results corresponding to each character.
[0082] In practical applications, a speaker decoder comprises multiple cascaded decoders. To facilitate understanding of the specific processing procedures of the encoders corresponding to each layer in the speaker decoder of a speech recognition model, we will combine... Figure 4 The processing procedure of the encoder corresponding to each layer is illustrated by example.
[0083] Figure 4 This is a flowchart illustrating how to determine the speaker vector representation corresponding to the nth word using a speaker decoder; for example... Figure 4 As shown, the speaker decoder includes multiple cascaded decoders, each of which includes an attention layer and a feedforward neural network layer. The method includes the following steps:
[0084] 401. Input the first vector representation, the second vector representation, and the third vector representation into the first layer decoder in the speaker decoder to obtain the speaker vector representation after the attention coefficient weighting of the (n-1)th word output by the attention layer in the first layer decoder.
[0085] 402. Input the speaker vector representation weighted by the attention coefficient corresponding to the (n-1)th word into the feedforward neural network layer in the first layer decoder to obtain the speaker vector representation corresponding to the (n-1)th word input to the second layer decoder.
[0086] 403. Determine the speaker vector representation corresponding to the (n-1)th word output by the last layer decoder in the speaker decoder.
[0087] 404. Determine the speaker vector representation corresponding to the nth word based on the speaker vector representation corresponding to the (n-1)th word output by the last decoder layer and the speaker vector representation weighted by the attention coefficients of the (n-1)th word output by the attention layer in the first decoder layer.
[0088] The speaker decoder in this embodiment includes multiple decoders, each located at a different level. Each decoder performs decoding layer by layer, with the output of the previous layer serving as the input to the next. Furthermore, each decoder includes an attention layer and a feedforward neural network layer. Decoding is performed by these two sub-layers: the attention layer and the feedforward neural network layer. First, the vector representation input to the decoder is transmitted to the attention layer for processing. The processing result is then input to the feedforward neural network layer, which processes the output of the attention layer and inputs the processed result to the attention layer of the next layer's decoder for further processing. In other words, each decoder in the speaker decoder performs processing sequentially according to this method.
[0089] Specifically, the processing in each decoder can include: First, the first, second, and third vector representations are input into the speaker decoder to obtain the speaker vector representation weighted by the attention coefficients corresponding to the (n-1)th word through the attention layer in the first decoder. Then, the speaker vector representation weighted by the attention coefficients corresponding to the (n-1)th word is input into the feedforward neural network layer in the first decoder to obtain the speaker vector representation corresponding to the (n-1)th word that is input into the second decoder. In other words, the output of the feedforward neural network layer in the first decoder (the speaker vector representation corresponding to the (n-1)th word) serves as the input to the attention layer in the second decoder.
[0090] In an alternative embodiment, specifically, the attention layer in the first-layer decoder can be based on the formula... Calculate the speaker vector representation weighted by the attention coefficients corresponding to the (n-1)th word. This represents the speaker vector representation after weighting the attention coefficients for the (n-1)th word output by the attention layer of the first decoder. H represents the multi-head attention of the attention layer of the first encoder. asr H represents the first vector representation. spk This represents the second vector representation. After calculating the speaker vector representation weighted by the attention coefficients for the (n-1)th word, the feedforward neural network layer in the first decoder can be used according to the formula... Calculate the speaker vector representation corresponding to the (n-1)th word. This represents the speaker vector representation corresponding to the (n-1)th word input to the second-layer decoder. This represents the speaker vector representation after weighting the attention coefficients for the (n-1)th word output by the attention layer of the first decoder. This represents the first layer of the feedforward neural network in the speaker encoder.
[0091] In an optional embodiment, the attention layers in the second-layer decoder and subsequent decoders include self-attention layers and source-target attention layers. After obtaining the speaker vector representation corresponding to the (n-1)th word input to the second-layer decoder through the feedforward neural network layer in the first-layer decoder, this vector representation is output to the self-attention layer in the second-layer decoder. The self-attention layer in the second-layer decoder performs self-attention calculations on the speaker vector representation corresponding to the (n-1)th word and the speaker vector representations corresponding to the previous (n-1)th words, obtaining a weighted vector representation of the (n-1)th word corresponding to the self-attention layer in the second-layer decoder. This weighted vector representation of the (n-1)th word is then input to the source-target attention layer in the second-layer decoder. The source-target attention layer in the second-layer decoder performs attention calculations on the weighted vector representation of the (n-1)th word and the second vector representation, obtaining a weighted vector representation of the (n-1)th word corresponding to the source-target attention layer in the second-layer decoder. This weighted vector representation of the (n-1)th word is then input to the feedforward neural network layer in the second-layer decoder. The feedforward neural network layer in the second-layer decoder processes the attention coefficient-weighted vector representation corresponding to the (n-1)th word to obtain the speaker vector representation corresponding to the (n-1)th word input to the third-layer encoder.
[0092] The second and subsequent decoders perform calculations according to the above method until the speaker vector representation corresponding to the (n-1)th word output by the last decoder in the speaker decoder is determined. Based on the speaker vector representation corresponding to the (n-1)th word output by the last decoder and the speaker vector representation weighted by the attention coefficients of the (n-1)th word output by the attention layer in the first decoder, the speaker vector representation corresponding to the nth word is determined. The detailed processing procedure is described above and will not be repeated here.
[0093] In an optional embodiment, the self-attention layers in the second-layer decoder and the decoders above the second layer can be determined according to the formula. Calculate the weighted vector representation of the self-attention coefficients corresponding to the (n-1)th word input to the source-target attention layer in this decoder layer. Here, l represents the layer number corresponding to the encoder, and l > 1. This represents the weighted vector representation of the self-attention coefficients of the (n-1)th word output from the self-attention layer in the l-th encoder (i.e., the weighted vector representation of the self-attention coefficients of the (n-1)th word input to the source-target attention layer in the l-th encoder). This represents the speaker vector representation corresponding to the (n-1)th word output by the feedforward neural network layer in the (l-1)th encoder layer. This represents the multi-head attention of the self-attention layer in the l-th encoder. This represents the speaker vector representation of the first n-1 words output by the feedforward neural network layer in the (l-1)th encoder layer.
[0094] In an optional embodiment, the source-target attention layers in the second-layer decoder and the decoders above the second layer can be determined according to the formula. Calculate the weighted vector representation of the attention coefficients corresponding to the (n-1)th word input to the feedforward neural network layer in this decoder layer. Here, l represents the layer number corresponding to the encoder, and l > 1. This represents the weighted vector representation of the attention coefficients of the (n-1)th word output from the source-target attention layer in the l-th encoder (i.e., the weighted vector representation of the attention coefficients of the (n-1)th word input to the feedforward neural network layer in the l-th encoder). This represents the weighted vector representation of the self-attention coefficients of the (n-1)th word output from the self-attention layer in the l-th encoder. H represents the multi-head attention of the source-target attention layer in the l-th encoder. spk This represents the second vector representation.
[0095] In an optional embodiment, the feedforward neural network layers in the second-layer decoder and subsequent decoders can be configured according to the formula... Calculate the speaker vector representation corresponding to the (n-1)th word input to the next encoder layer. Here, l represents the encoder layer number, and l > 1. This represents the speaker vector representation corresponding to the (n-1)th word output by the feedforward neural network in the l-th encoder layer (i.e., the speaker vector representation corresponding to the (n-1)th word input to the (l+1)-th encoder layer). This represents the weighted vector representation of the attention coefficients corresponding to the (n-1)th word output from the source-target attention layer in the l-th encoder. This represents the feedforward neural network in the l-th layer encoder.
[0096] The outputs of the feedforward neural network layers in each decoder layer are determined sequentially until the speaker vector representation corresponding to the (n-1)th word output by the last decoder layer in the speaker decoder is determined. Finally, the speaker decoder performs a skip-connection process on the speaker vector representation corresponding to the (n-1)th word output by the last decoder layer and the speaker vector representation weighted by the attention coefficients of the (n-1)th word output by the attention layer in the first decoder layer. Based on the skip-connection result, the speaker vector representation corresponding to the nth word is predicted. Specifically, in an optional embodiment, it is assumed that the speaker vector representation corresponding to the (n-1)th word output by the last decoder layer in the speaker decoder is... The speaker vector, weighted by the attention coefficients of the (n-1)th word output from the attention layer in the first decoder, is represented as follows: According to the formula Determine the speaker vector representation q corresponding to the nth word. n Among them, W q This represents the weights set in the speaker encoder.
[0097] In this embodiment of the invention, the speaker decoder includes multiple layers of decoders. The first layer of decoders processes a first vector representation, a second vector representation, and a third vector representation containing contextual information to obtain a speaker vector representation weighted by attention coefficients for the (n-1)th character output by the attention layer in the first layer of decoders. That is, the first layer of decoders incorporates contextual information during processing, making the speaker representation weighted by attention coefficients for the (n-1)th character more accurate. Then, the feedforward neural network layer in the first layer of decoders processes the speaker vector weighted by attention coefficients for the (n-1)th character obtained after incorporating contextual information. This results in a higher quality speaker vector representation for the (n-1)th character input to the second layer of decoders, better reflecting the speaker features for the (n-1)th character. Further processing by multiple layers of decoders yields even higher quality speaker vector representations for each character, thereby improving the accuracy of the predicted speaker results for each character.
[0098] The above embodiments describe implementation methods for determining the speaker vector representation corresponding to each character. However, in practical applications, after determining the speaker vector representation corresponding to each character, the speaker corresponding to each character can be further determined based on the speaker vector representation corresponding to each character. Specifically, in an optional embodiment, the speaker corresponding to the nth character is determined based on the correlation coefficients between the speaker vector representation corresponding to the nth character and the portrait feature vectors of multiple speakers. Therefore, when determining the speaker corresponding to the nth character, the correlation coefficients between the speaker vector representation corresponding to the nth character and the portrait feature vectors of multiple speakers are first determined. In the following embodiments, in conjunction with the appendix... Figure 5 The specific process of determining the correlation coefficient between the speaker vector representation corresponding to the nth word and the portrait feature vectors of multiple speakers is illustrated by an example.
[0099] Figure 5 A flowchart of another speech recognition method provided in an embodiment of the present invention; as follows: Figure 5 As shown, in order to improve the accuracy of the speech recognition results of this method, based on the above embodiments, the method may further include the following steps:
[0100] 501. For any speaker's profile feature vector among multiple speakers, determine the first similarity between the speaker vector representation corresponding to the nth word and any speaker's profile feature vector according to the set similarity algorithm.
[0101] 502. Obtain the speaker vector representation corresponding to each of the first n words.
[0102] 503. Concatenate the speaker vector representation corresponding to each of the first n words with the portrait feature vector of any speaker.
[0103] 504. Input the multiple concatenated vector representations into the set scoring model containing the attention module to obtain the second similarity between the speaker vector representation corresponding to each of the first n words and the portrait feature vector of any speaker.
[0104] 505. Based on the first similarity between the speaker vector representation corresponding to the nth character and the portrait feature vector of any speaker, and the second similarity between the speaker vector representation corresponding to the nth character and the portrait feature vector of any speaker, determine the correlation coefficient between the speaker vector representation corresponding to the nth character and the portrait feature vector of any speaker.
[0105] In this embodiment of the invention, a joint scoring method using a pre-defined similarity algorithm and a scoring model including an attention module is employed to determine the correlation coefficient between the speaker vector representation corresponding to the nth character and the profile feature vector of any speaker. The correlation coefficient can be used to represent the posterior probability of each speaker corresponding to the nth character.
[0106] Specifically, for the profile feature vector of any speaker among multiple speakers, a first similarity is determined between the speaker vector representation corresponding to the nth word and the profile feature vector of any speaker, based on a predefined similarity algorithm. In other words, the first similarity between the speaker vector representation corresponding to the nth word and the profile feature vectors of each speaker can be determined using the predefined similarity algorithm. Optionally, the predefined similarity algorithm can be a cosine similarity algorithm, for example, based on the formula... The first similarity can be determined between the speaker vector representation corresponding to the nth word and the feature vectors of each speaker's profile. Wherein, The speaker vector corresponding to the nth word represents the first similarity between it and the feature vector of the kth speaker's profile, where cos represents the cosine function, and q represents the first similarity. n d represents the speaker vector representation corresponding to the nth word. k Let represent the feature vector of the k-th speaker, where k is the k-th speaker among the k speakers contained in the speech signal. In other words, the speaker vector corresponding to the n-th word represents the probability of the n-th word corresponding to each speaker, calculated by performing cosine similarity scores with the feature vectors of each speaker.
[0107] As can be seen from the above description, the determination of the first similarity only considers the speaker vector representation corresponding to the nth word, and does not consider the speaker vector representations corresponding to the first n words. Therefore, the determination of the first similarity is irrelevant to other speakers.
[0108] To improve the accuracy of the posterior probabilities of each word corresponding to each speaker, the speaker vector representations of other speakers are fully considered when determining the correlation coefficient between the speaker vector representation corresponding to the nth word and the profile feature vector of any speaker. Specifically, a scoring model with an attention module is added to the speech recognition model to obtain the second similarity between the speaker vector representations corresponding to the first n words and the profile feature vector of any speaker.
[0109] The specific process for determining the second similarity can be as follows: Obtain the speaker vector representation corresponding to each of the first n characters, and concatenate each speaker vector representation corresponding to the first n characters with the portrait feature vector of any speaker. Input the multiple concatenated vector representations into a predefined scoring model containing an attention module to obtain the second similarity between each speaker vector representation corresponding to the first n characters and the portrait feature vector of any speaker.
[0110] Specifically, in an optional embodiment, it can be based on the formula Determine the second similarity between the speaker vector representation corresponding to each of the first n words and the feature vector of each speaker's profile. Wherein, Let q represent the second similarity between the speaker vector corresponding to the nth word and the profile feature vector of the kth speaker, tanh is the hyperbolic tangent function, and CD-scorer represents the scoring model that includes an attention module. [1:n] d represents the speaker vector representation corresponding to the first n words. k Let represent the feature vector of the k-th speaker. The second similarity between the speaker vector representation corresponding to the n-th word and the feature vectors of each speaker can be generated according to this formula. As shown in the formula, the method for determining the second similarity is as follows: obtain the speaker vector representation corresponding to each of the first n words, concatenate the speaker vector representation corresponding to each of the first n words with the feature vector of the k-th speaker to obtain a first vector sequence of length n. Then, input the first vector sequence into a scoring model (e.g., using a transformer model structure) to obtain a second vector sequence. The second vector sequence is a sequence with values in the range [-1, 1] obtained after tanh transformation. The last value in this sequence is taken as the second similarity between the speaker vector representation corresponding to the n-th word and the feature vector of the k-th speaker. Tanh is merely an example of a mapping function and is not a limitation.
[0111] After determining the first and second similarities, the correlation coefficient between the speaker vector representation of the nth character and the portrait feature vector of any speaker is determined based on the first similarity between the speaker vector representation of the nth character and the portrait feature vector of any speaker, and the second similarity between the speaker vector representation of the nth character and the portrait feature vector of any speaker. In other words, for the nth character, the first and second similarities are calculated for each speaker, the first and second similarities between the nth character and each speaker are added together to obtain a first sum, and this first sum is divided by the sum of the first and second similarities between the nth character and all speakers to obtain the correlation coefficient between the speaker vector representation of the nth character and the portrait feature vector of each speaker.
[0112] Specifically, in an optional embodiment, it can be based on the formula Determine the correlation coefficient between the speaker vector representation corresponding to the nth word and the profile feature vectors of each speaker. Where β n,k Let represent the correlation coefficient between the speaker vector corresponding to the nth word and the profile feature vector of the kth speaker, and exp represent an exponential function with base e. The speaker vector corresponding to the nth word represents the first similarity between it and the feature vector of the kth speaker's profile. Let represent the second similarity between the speaker vector representation corresponding to the nth word and the profile feature vector of the kth speaker, where K is the total number of speakers in the speech signal, and j is the jth speaker. Alternatively, k can be any one of the K speakers. Therefore, according to the above formula, the correlation coefficient between the speaker vector representation corresponding to the nth word and the profile feature vectors of each speaker can be obtained.
[0113] After determining the correlation coefficients between the speaker vector representation corresponding to the nth character and the portrait feature vectors of the multiple speakers, the speaker corresponding to the nth character can be determined based on the correlation coefficients between the speaker vector representation corresponding to the nth character and the portrait feature vectors of the multiple speakers.
[0114] In this embodiment of the invention, a first similarity is determined between the speaker vector representation corresponding to the nth character and the portrait feature vector of any speaker based on a set similarity algorithm. A second similarity is obtained between the speaker vector representations corresponding to the first n characters and the portrait feature vector of any speaker based on a set scoring model including an attention module. Then, based on the first similarity between the speaker vector representation corresponding to the nth character and the portrait feature vector of any speaker, and the second similarity between the speaker vector representation corresponding to the nth character and the portrait feature vector of any speaker, a correlation coefficient is determined between the speaker vector representation corresponding to the nth character and the portrait feature vector of any speaker. That is, by combining the set similarity algorithm with the set scoring model including an attention module, the determined correlation coefficient is more accurate, resulting in better performance of the speech recognition model and further improving the accuracy of the speech recognition results.
[0115] The above embodiment describes a specific implementation method for determining the correlation coefficients between the speaker vector representation corresponding to the nth character and the portrait feature vectors of the multiple speakers. After determining the correlation coefficients between the speaker vector representation corresponding to the nth character and the portrait feature vectors of the multiple speakers, a weighted sum of the correlation coefficients over the portrait feature vectors of the multiple speakers is calculated. Specifically, the product values of the multiple correlation coefficients and the portrait feature vectors of each speaker are obtained, and then the multiple product values are summed to obtain the weighted speaker feature vector corresponding to the nth character.
[0116] After obtaining the weighted sum of the relevance coefficients over the feature vectors of multiple speakers, the first vector representation, the first n-1 words, and the weighted sum of the relevance coefficients over the feature vectors of multiple speakers are input into the speech recognition decoder to obtain the nth word. In practical applications, a speech recognition decoder includes multiple cascaded decoders. To facilitate understanding of the specific processing steps of the encoders in each layer of the speech recognition decoder in the speech recognition model for the first vector representation, the first n-1 words, and the weighted sum of the relevance coefficients over the feature vectors of multiple speakers, we will combine... Figure 6 The processing procedure of the encoder corresponding to each layer is illustrated by example.
[0117] Figure 6 This is a flowchart illustrating how to obtain the nth word using a speech recognition decoder; for example... Figure 6 As shown, the speech recognition decoder includes multiple cascaded decoders, each of which includes an attention layer and a feedforward neural network layer. The method includes the following steps:
[0118] 601. Perform embedding encoding on the first n-1 characters respectively to obtain the embedding vector representation of each of the first n-1 characters.
[0119] 602. Input the embedding vector representations and the first vector representations corresponding to the first n-1 characters into the first layer decoder of the speech recognition decoder to obtain the attention coefficient weighted vector representation of the n-1th character output by the attention layer in the first layer decoder.
[0120] 603. Input the weighted vector representation of the attention coefficient corresponding to the (n-1)th character and the weighted sum into the feedforward neural network layer in the first layer decoder to obtain the vector representation of the (n-1)th character input into the second layer decoder.
[0121] 604. Determine the nth character based on the vector representation of the (n-1)th character output by the last layer of the speech recognition decoder.
[0122] The speech recognition decoder in this embodiment of the invention includes a multi-layer decoder. Each layer decodes sequentially, with the output of the previous layer serving as the input to the next. Each decoder includes an attention layer and a feedforward neural network layer, with each sub-layer processing the vector representation input to the encoder. The speech decoder predicts each character sequentially using an autoregressive approach. For example, during the nth iteration, to predict the nth character corresponding to the input speech signal, the nth character is predicted based on the first n-1 characters, the weighted vector representation of the attention coefficients corresponding to the (n-1)th character, and the weighted sum of the relevance coefficients on the feature vectors of multiple speaker profiles. The predicted nth character is obtained based on the probabilities of each character in dictionary V corresponding to the nth character. Assuming the dictionary contains 4950 characters, the probability on of the nth character corresponding to each character in dictionary V is determined, i.e., on is 4950 probabilities. The character with the highest probability among these 4950 probabilities is selected as the prediction result for the nth character.
[0123] Specifically, embedding encoding is performed on the first n-1 characters to obtain the embedding vector representation for each of the first n-1 characters. Optionally, embedding processing can be performed on the first n-1 characters first to obtain the processed vector representation, and then positional encoding processing can be performed on the processed vector representation to obtain the embedding vector representation for each of the first n-1 characters. For example, according to the formula... Obtain the embedding vector representations corresponding to the first n-1 words. Here, PosEnc represents the positional encoding, Embed represents the embedding function, and y... [1:n-1] This indicates the first n-1 words that the speech recognition decoder has output.
[0124] After obtaining the embedding vector representations corresponding to the first n-1 characters, the embedding vector representations corresponding to the first n-1 characters and the first vector representation are input into the attention layer in the first layer decoder. The attention layer in the first layer decoder processes the embedding vector representations corresponding to the first n-1 characters and the first vector representation to obtain the vector representation of the attention coefficients of the (n-1)th character corresponding to the attention layer in the first layer decoder, and inputs it into the feedforward neural network layer in the first layer decoder.
[0125] In one optional embodiment, the attention layers in each decoder of the speech recognition encoder may include a self-attention layer and a source-target attention layer. After the speaker encoder receives the embedding vector representations and first vector representations corresponding to the first n-1 characters, it inputs these embedding vector representations to the self-attention layer in the first-layer decoder to obtain a weighted vector representation of the self-attention coefficients for the (n-1)th character output by the self-attention layer in the first-layer decoder. This vector representation is then input to the source-target attention layer in the first-layer decoder. The source-target attention layer in the first-layer decoder calculates the weighted vector representation of the self-attention coefficients for the (n-1)th character and the first vector representation to obtain a weighted vector representation of the attention coefficients for the (n-1)th character output by the source-target attention layer. Next, the weighted vector representation of the attention coefficients for the (n-1)th character and the weighted sum of the relevance coefficients over the feature vectors of multiple speaker profiles are input to the feedforward neural network layer in the first-layer decoder to obtain the vector representation of the (n-1)th character input to the second-layer decoder. The processing operations of the second layer and above decoders are completed according to the above method until the vector representation of the (n-1)th word output by the last layer decoder in the speech recognition decoder is obtained.
[0126] Specifically, in an optional embodiment, the self-attention layers in the first layer and above of the decoder can be configured according to the formula... Calculate the weighted vector representation of the self-attention coefficients corresponding to the (n-1)th word input to the source-target attention layer in this layer's decoder. Here, l represents the layer number corresponding to the encoder. This represents the weighted vector representation of the self-attention coefficients of the (n-1)th word output from the self-attention layer in the l-th encoder (i.e., the weighted vector representation of the self-attention coefficients of the (n-1)th word input to the source-target attention layer in the l-th encoder). This represents the vector representation of the (n-1)th word output by the feedforward neural network layer in the (l-1)th encoder layer. This represents the multi-head attention of the self-attention layer in the l-th encoder. This represents the vector representation of the first n-1 words output by the feedforward neural network layer in the (l-1)th encoder (the vector representation of the first n-1 words input to the l-th encoder).
[0127] In an optional embodiment, the source-target attention layers in the first layer and above of the decoder can be configured according to the formula... Calculate the weighted vector representation of the attention coefficients corresponding to the (n-1)th word input to the feedforward neural network layer in this decoder layer. Here, l represents the layer number corresponding to the encoder, and... This represents the weighted vector representation of the attention coefficients of the (n-1)th word output from the source-target attention layer in the l-th encoder (i.e., the weighted vector representation of the attention coefficients of the (n-1)th word input to the feedforward neural network layer in the l-th encoder). This represents the weighted vector representation of the self-attention coefficients of the (n-1)th word output from the self-attention layer in the l-th encoder. H represents the multi-head attention of the source-target attention layer in the l-th encoder. asr This represents the first vector representation.
[0128] In an alternative embodiment, the feedforward neural network layer in the first decoder can be configured according to the formula... Calculate the vector representation corresponding to the (n-1)th word input to the second-layer encoder. Here, l represents the layer number of the encoder, and l = 1. This represents the vector representation of the (n-1)th word output from the feedforward neural network in the l-th encoder layer (i.e., the vector representation of the (n-1)th word input to the (l+1)-th encoder layer). This represents the weighted vector representation of the attention coefficients corresponding to the (n-1)th word output from the source-target attention layer in the l-th encoder. W represents the feedforward neural network in the l-th encoder layer. spk The weights corresponding to the speaker encoder. This is a weighted sum of the correlation coefficients over the feature vectors of multiple speaker profiles. For feedforward neural network layers in the second and higher layers of the decoder, this can be calculated using the formula... Calculate the vector representation corresponding to the (n-1)th word input to the next encoder layer (i.e., the output of the feedforward neural network layer is used as the input to the next encoder layer). Here, l represents the number of layers corresponding to the encoder, and l>1.
[0129] Finally, the nth character is determined based on the vector representation of the (n-1)th character output by the last layer of the speech recognition decoder. Specifically, a softmax operation is used to process the vector representation of the (n-1)th character output by the last layer of the decoder to obtain the posterior probability o of the nth character. n Then, based on the posterior probability o corresponding to the nth character... n Predict the nth character. For example, suppose the vector representation of the (n-1)th character output by the last decoder layer is... According to the formula Determine the posterior probability o corresponding to the nth character. n Where softmax is the activation function, W o b o These are the weights corresponding to the speech recognition encoder.
[0130] As described above, for the first layer decoder in the speech recognition decoder, the first n-1 characters output are processed by embedding encoding. After obtaining the embedding vector representations corresponding to the n-1 characters input to the first layer decoder, only the embedding vector representation of the n-1th character is taken as the query in the self-attention layer. The embedding vector representations corresponding to these n-1 characters are used as key and value, and self-attention processing is performed to obtain the result of the n-1th character after attention weighting of the first n-1 characters, that is, the encoded vector after weighting the self-attention coefficients of the n-1th character. This is then added to the embedding vector representation of the n-1th character before weighting. The result of the addition is the vector representation after weighting the self-attention coefficients of the n-1th character, which is used as one of the inputs to the source-target attention layer. The other two inputs are the two first vector representations. Finally, the vector representation after weighting the attention coefficients of the n-1th character output by the source-target attention layer in the first layer is obtained. Then, the vector representation of the attention coefficient corresponding to the (n-1)th word is input into the feedforward neural network layer in the first encoder so that the output of the first encoder is also the input of the second encoder.
[0131] It is important to note that when calculating the output of the feedforward neural network layer in the first encoder layer, an additional layer is added. This term is not added when processing data in the feedforward neural network layers of other encoder layers. However, it is introduced during the computation in the feedforward neural network layer of the first encoder layer. This is to introduce a weighted speech vector representation of the nth character during speech recognition, corresponding to each speaker. This ensures that the recognized nth character includes not only its content but also the speaker's identity. Ultimately, the speech recognition encoder can directly output the recognition result in SOT format, such as a speaker's text sequence. <sc>Another speaker's text sequence.
[0132] In summary, in this embodiment of the invention, the speech decoder uses an autoregressive approach to predict each character one by one. When predicting each character, each layer of the speech encoder decodes the first n-1 characters of the input, the vector representation of the attention coefficient corresponding to the n-1th character, and the weighted sum of the relevance coefficients of the feature vectors of multiple speakers' profiles, layer by layer. The output of the previous layer decoder is used as the input of the next layer decoder. Finally, the nth character is determined based on the vector representation of the n-1th character output by the last layer decoder in the speech recognition decoder. The nth character obtained in this way not only includes the content corresponding to the character and the speaker corresponding to the character, so that the SOT format text content can be directly generated based on the predicted characters, but also takes into account the context information when determining each character, which can make the predicted characters more accurate.
[0133] In recent years, more and more research has focused on speech recognition processing in more realistic scenarios, such as automatic recording of multi-party meetings, multi-party human-computer interaction, and automatic audio / video annotation. For specific applications, please refer to the appendix. Figure 7 As shown, the speech recognition model includes a speech recognition encoder, a speaker encoder, a speech recognition decoder, a speaker decoder, a text encoder with an attention layer, and a joint scorer. The joint scorer includes a speaker-independent scorer and a speaker-related scorer. The speaker-independent scorer determines the first similarity between the speaker vector representation corresponding to the nth word and the portrait feature vector of any speaker, based on a predefined similarity algorithm. The speaker-related scorer includes a predefined scoring model with an attention module, which obtains the second similarity between the speaker vector representations corresponding to the first n words and the portrait feature vector of any speaker.
[0134] Specifically, during speech recognition, each character corresponding to the speech signal is predicted iteratively. The specific recognition and prediction process for each character is basically the same; here, we will use the recognition and prediction process for the nth character as an example. First, we acquire the speech signals and feature vectors of multiple speakers, where the speech signal contains the speech of multiple speakers. Next, we input the speech signals of multiple speakers into the speech recognition encoder and the speaker encoder. The speech recognition encoder obtains the first vector representation corresponding to the speech signal, and the speaker encoder obtains the second vector representation corresponding to the speech signal. The first vector representation is used for speech recognition, and the second vector representation is used for speaker recognition. We acquire the first n-1 characters output by the speech recognition decoder and encode them using the text encoder to obtain the third vector representation corresponding to the (n-1)th character. We then input the first, second, and third vector representations into the speaker decoder to obtain the speaker vector representation corresponding to the nth character.
[0135] Then, the speaker vector representation corresponding to the nth character is concatenated with the portrait feature vector of any speaker. The concatenated vector representation is input into a speaker-independent scorer. For the portrait feature vector of any speaker among multiple speakers, the first similarity between the speaker vector representation corresponding to the nth character and the portrait feature vector of any speaker is determined according to a set similarity algorithm. Next, the speaker vector representations corresponding to the first n characters are concatenated with the portrait feature vector of any speaker. The multiple concatenated vector representations are input into a scoring model containing an attention module in a speaker-related scorer to obtain the second similarity between the speaker vector representations corresponding to the first n characters and the portrait feature vector of any speaker. Based on the first similarity between the speaker vector representation corresponding to the nth character and the portrait feature vector of any speaker, and the second similarity between the speaker vector representation corresponding to the nth character and the portrait feature vector of any speaker, the correlation coefficient between the speaker vector representation corresponding to the nth character and the portrait feature vector of any speaker is determined. Then, the weighted sum of this correlation coefficient over the portrait feature vectors of multiple speakers is calculated.
[0136] Next, the first vector representation, the first n-1 characters, and the weighted sum of the relevance coefficients and the feature vectors of multiple speaker profiles are input into the speech recognition decoder to obtain the posterior probability corresponding to the nth character. Then, based on the posterior probability of the nth character, the nth character is predicted. Finally, based on the characters output by the speech recognizer, a text sequence in SOT format is generated.
[0137] The specific implementation process involved in the embodiments of the present invention can be referred to the content of the above embodiments, and will not be repeated here.
[0138] To enable the text encoder, speaker-independent scorer, and speaker-related scorer in the above embodiments to better obtain the global context information corresponding to the speech signal, a two-stage decoding process is performed. The first stage of decoding involves predicting each character corresponding to the speech signal one by one according to the methods in the above embodiments. After obtaining all N characters corresponding to the speech signal, the second stage of decoding is performed. The second stage of decoding uses the prediction results of all N characters obtained from the first stage of decoding as input to the text encoder. Then, it calculates the correlation coefficient between the speaker vector representation corresponding to the nth character and the feature vectors of multiple speaker profiles, thus obtaining the probability distribution of the speaker corresponding to each of the N characters. Based on the probability distribution of the speaker corresponding to each of the N characters, the speaker corresponding to each character is re-obtained, making the final result of the N characters more accurate. The specific implementation process can be found in the appendix. Figure 8 As shown.
[0139] Figure 8 A flowchart of another speech recognition method provided in an embodiment of the present invention; as follows: Figure 8 As shown, in order to improve the accuracy of the speech recognition results of this method and enable the speech recognition method to better integrate global contextual information, based on the above embodiments, the method may further include the following steps:
[0140] 801. Determine the complete text sequence output by the speech recognition decoder.
[0141] 802. The text sequence is encoded using a text encoder to obtain the fourth vector representation of each character in the text sequence.
[0142] 803. Input the first vector representation, the second vector representation, and the fourth vector representation into the speaker decoder to obtain the speaker vector representation corresponding to each word.
[0143] 804. Based on the correlation coefficients between the speaker vector representation corresponding to each character and the portrait feature vectors of multiple speakers, redetermine the speaker corresponding to each character.
[0144] In this embodiment, during the second-stage decoding, the complete text sequence output by the first-stage speech recognition decoder is first determined. Then, the complete text sequence is input to a text encoder to encode the text sequence, obtaining the fourth vector representation corresponding to each character in the text sequence. Next, the first, second, and fourth vector representations are input to the speaker decoder to obtain the speaker vector representation corresponding to each character. Based on the correlation coefficients between the speaker vector representation corresponding to each character and the profile feature vectors of multiple speakers, the speaker corresponding to each character is re-determined.
[0145] As described above, the first stage of decoding predicts the nth character based on the first n-1 characters, only obtaining the preceding context information, not the following context information. The second stage of decoding, performed after all N characters have been predicted, involves predicting again based on all N characters. This allows for the acquisition of both preceding and following context information for each character, resulting in more accurate predictions of the speaker for each character obtained through this second round of calculation.
[0146] The specific implementation process involved in the embodiments of the present invention can be referred to the contents of the above embodiments, and will not be repeated here.
[0147] The speech recognition method provided in this invention can be executed in the cloud, where multiple computing nodes (cloud servers) can be deployed. Each computing node has processing resources such as computing and storage. In the cloud, multiple computing nodes can be organized to provide a certain service; of course, a single computing node can also provide one or more services. The cloud provides this service by providing an external service interface, which users call to use the corresponding service.
[0148] According to the solution provided in this embodiment of the invention, the cloud can provide a service interface with speech recognition services. Users can call this service interface through their terminal devices to trigger a speech recognition service request to the cloud. The request includes speech signals from multiple speakers and feature vectors of multiple speaker profiles. The speech signals contain the speech of multiple speakers. The cloud determines the computing node that responds to the request and uses the processing resources in the computing node to perform the following steps:
[0149] Obtain a speech recognition model, which includes a speech recognition encoder, a speaker encoder, a speech recognition decoder, a speaker decoder, and a text encoder containing an attention layer;
[0150] A first vector representation corresponding to the speech signal is obtained through a speech recognition encoder, and a second vector representation corresponding to the speech signal is obtained through a speaker encoder. The first vector representation is used for speech recognition, and the second vector representation is used for speaker recognition.
[0151] The first n-1 characters output by the speech recognition decoder are encoded by a text encoder to obtain the third vector representation corresponding to the n-1th character;
[0152] The first vector representation, the second vector representation, and the third vector representation are input into the speaker decoder to obtain the speaker vector representation corresponding to the nth word;
[0153] The speaker corresponding to the nth character is determined based on the correlation coefficients between the speaker vector representation corresponding to the nth character and the profile feature vectors of multiple speakers.
[0154] The first vector representation, the first n-1 words, and the weighted sum of the relevance coefficients for the feature vectors of multiple speakers are input into the speech recognition decoder to obtain the nth word.
[0155] The speech recognition output information is sent to the terminal device. The speech recognition output information includes the text sequence corresponding to each of the multiple speakers.
[0156] The above execution process can be referred to the relevant descriptions in the other embodiments mentioned above, and will not be repeated here.
[0157] For ease of understanding, combined with Figure 9 To illustrate with an example. Users can... Figure 9 The terminal device E1 illustrated in the diagram calls a speech recognition service to perform speech recognition on a target speech signal in order to obtain a predicted text sequence corresponding to the target speech signal. The service interface for users to call this service can take the form of a Software Development Kit (SDK) or an Application Programming Interface (API). Figure 9 The diagram illustrates an API interface scenario. In the cloud, as shown, assume that a speech recognition service is provided by a service cluster E2, which includes at least one computing node. Upon receiving the request, service cluster E2 executes the steps described in the previous embodiment to obtain the text sequence corresponding to the target speech signal, which includes multiple text sequences for each speaker, and then feeds it back to the terminal device E1.
[0158] The following describes in detail one or more embodiments of a speech recognition device according to the present invention. Those skilled in the art will understand that these devices can all be configured using commercially available hardware components through the steps taught in this solution.
[0159] Figure 10 This is a schematic diagram of the structure of a speech recognition device provided in an embodiment of the present invention. The speech recognition model includes a speech recognition encoder, a speaker encoder, a speech recognition decoder, a speaker decoder, and a text encoder containing an attention layer, such as... Figure 10 As shown, the device includes: an acquisition module 11, a first encoding module 12, a second encoding module 13, a first decoding module 14, and a second decoding module 15.
[0160] The acquisition module 11 is used to acquire the speech signals of multiple speakers and the portrait feature vectors of the multiple speakers, wherein the speech signals contain the speech of the multiple speakers.
[0161] The first encoding module 12 is used to obtain a first vector representation corresponding to the speech signal through the speech recognition encoder and to obtain a second vector representation corresponding to the speech signal through the speaker encoder. The first vector representation is used for speech recognition and the second vector representation is used for speaker recognition.
[0162] The second encoding module 13 is used to encode the first n-1 characters output by the speech recognition decoder through the text encoder to obtain the third vector representation corresponding to the n-1th character.
[0163] The first decoding module 14 is used to input the first vector representation, the second vector representation, and the third vector representation into the speaker decoder to obtain the speaker vector representation corresponding to the nth word; and to determine the speaker corresponding to the nth word based on the correlation coefficients between the speaker vector representation corresponding to the nth word and the portrait feature vectors of the plurality of speakers.
[0164] The second decoding module 15 is used to input the first vector representation, the first n-1 characters, and the weighted sum of the correlation coefficients on the portrait feature vectors of the multiple speakers into the speech recognition decoder to obtain the nth character.
[0165] Optionally, the first decoding module 14 is further configured to: for the portrait feature vector of any speaker among the plurality of speakers, determine a first similarity between the speaker vector representation corresponding to the nth character and the portrait feature vector of any speaker according to a set similarity algorithm; obtain the speaker vector representations corresponding to the first n characters respectively; concatenate the speaker vector representations corresponding to the first n characters with the portrait feature vector of any speaker respectively; input the multiple concatenated vector representations into a set scoring model including an attention module to obtain a second similarity between the speaker vector representations corresponding to the first n characters and the portrait feature vector of any speaker respectively; and determine a correlation coefficient between the speaker vector representation corresponding to the nth character and the portrait feature vector of any speaker according to the first similarity between the speaker vector representation corresponding to the nth character and the portrait feature vector of any speaker and the second similarity between the speaker vector representation corresponding to the nth character and the portrait feature vector of any speaker.
[0166] Optionally, the second encoding module 13 is further configured to: obtain the weighted vector representation of the attention coefficients of the first n-1 characters obtained by the self-attention layer in the speech recognition decoder after performing attention calculations on the first n-1 characters respectively; input the weighted vector representation of the attention coefficients of the first n-1 characters respectively into the text encoder to obtain the third vector representation of the n-1th character output by the text encoder.
[0167] Optionally, the text encoder includes at least one cascaded encoder, each of which includes an attention layer and a feedforward neural network layer.
[0168] Optionally, the speech recognition decoder includes multiple cascaded decoders, each decoder including an attention layer and a feedforward neural network layer; based on this, the second decoding module 15 is further configured to: perform embedding encoding on the first n-1 characters respectively to obtain the embedding vector representation corresponding to each of the first n-1 characters; input the embedding vector representation corresponding to each of the first n-1 characters and the first vector representation into the first layer decoder of the speech recognition decoder to obtain the attention coefficient weighted vector representation of the (n-1)th character output by the attention layer in the first layer decoder; input the attention coefficient weighted vector representation of the (n-1)th character and the weighted sum into the feedforward neural network layer in the first layer decoder to obtain the vector representation of the (n-1)th character input to the second layer decoder; and determine the nth character based on the vector representation of the (n-1)th character output by the last layer decoder in the speech recognition decoder.
[0169] Optionally, the speaker decoder includes multiple cascaded decoders, each decoder including an attention layer and a feedforward neural network layer; based on this, the first decoding module 14 is further configured to: input the first vector representation, the second vector representation, and the third vector representation into the first layer decoder of the speaker decoder to obtain the speaker vector representation of the (n-1)th word output by the attention layer of the first layer decoder after attention coefficient weighting; input the speaker vector representation of the (n-1)th word output by attention coefficient weighting into the feedforward neural network layer of the first layer decoder to obtain the speaker vector representation of the (n-1)th word input to the second layer decoder; determine the speaker vector representation of the (n-1)th word output by the last layer decoder of the speaker decoder; and determine the speaker vector representation of the nth word based on the speaker vector representation of the (n-1)th word output by the last layer decoder and the speaker vector representation of the (n-1)th word output by the attention layer of the first layer decoder after attention coefficient weighting.
[0170] Optionally, the second decoding module 15 is further configured to: determine the complete text sequence output by the speech recognition decoder; encode the text sequence using the text encoder to obtain a fourth vector representation corresponding to each character in the text sequence; input the first vector representation, the second vector representation, and the fourth vector representation into the speaker decoder to obtain a speaker vector representation corresponding to each character; and re-determine the speaker corresponding to each character based on the correlation coefficients between the speaker vector representations corresponding to each character and the profile feature vectors of the plurality of speakers.
[0171] Figure 10 The device shown can perform the steps in the speech recognition method in the foregoing embodiments. For detailed execution process and technical effects, please refer to the description in the foregoing embodiments, which will not be repeated here.
[0172] This invention also provides an electronic device, such as... Figure 11 As shown, the electronic device may include: a processor 21, a memory 22, and a communication interface 23. The memory 22 stores executable code, which, when executed by the processor 21, enables the processor 21 to implement the speech recognition method as described in the preceding embodiments.
[0173] In addition, embodiments of the present invention provide a non-transitory machine-readable storage medium storing executable code, which, when executed by a processor of an electronic device, enables the processor to at least implement the speech recognition method provided in the foregoing embodiments.
[0174] The device embodiments described above are merely illustrative, and the units described as separate components may or may not be physically separate. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.
[0175] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of a necessary general-purpose hardware platform, or by a combination of hardware and software. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a computer product. The present invention can take the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0176] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.< / sc> < / sc> < / sc>
Claims
1. A speech recognition method, characterized in that, The speech recognition model includes a speech recognition encoder, a speaker encoder, a speech recognition decoder, a speaker decoder, and a text encoder containing an attention layer; the method includes: Acquire speech signals from multiple speakers and profile feature vectors of the multiple speakers, wherein the speech signals contain the speech of the multiple speakers; The speech recognition encoder obtains a first vector representation corresponding to the speech signal, and the speaker encoder obtains a second vector representation corresponding to the speech signal. The first vector representation is used for speech recognition, and the second vector representation is used for speaker recognition. The text encoder encodes the first n-1 characters output by the speech recognition decoder to obtain the third vector representation corresponding to the n-1th character; The first vector representation, the second vector representation, and the third vector representation are input into the speaker decoder to obtain the speaker vector representation corresponding to the nth word; The speaker corresponding to the nth character is determined based on the correlation coefficients between the speaker vector representation corresponding to the nth character and the profile feature vectors of the multiple speakers. The first vector representation, the first n-1 characters, and the weighted sum of the correlation coefficients on the profile feature vectors of the multiple speakers are input into the speech recognition decoder to obtain the nth character. Determine the complete text sequence output by the speech recognition decoder; The text sequence is encoded by the text encoder to obtain the fourth vector representation of each character in the text sequence; The first vector representation, the second vector representation, and the fourth vector representation are input into the speaker decoder to obtain the speaker vector representation corresponding to each word; Based on the correlation coefficients between the speaker vector representation corresponding to each word and the portrait feature vectors of the multiple speakers, the speaker corresponding to each word is re-determined.
2. The method according to claim 1, characterized in that, The method further includes: For the portrait feature vector of any speaker among the plurality of speakers, a first similarity is determined between the speaker vector representation corresponding to the nth word and the portrait feature vector of any speaker according to a set similarity algorithm; Obtain the speaker vector representation for each of the first n words; Each of the first n words is concatenated with the speaker's profile feature vector; Multiple concatenated vector representations are input into a scoring model containing an attention module to obtain the second similarity between the speaker vector representation corresponding to each of the first n words and the portrait feature vector of any speaker. Based on the first similarity between the speaker vector representation corresponding to the nth character and the portrait feature vector of any speaker, and the second similarity between the speaker vector representation corresponding to the nth character and the portrait feature vector of any speaker, the correlation coefficient between the speaker vector representation corresponding to the nth character and the portrait feature vector of any speaker is determined.
3. The method according to claim 1, characterized in that, The step of encoding the first n-1 characters output by the speech recognition decoder using the text encoder to obtain the third vector representation corresponding to the n-1th character includes: Obtain the weighted vector representation of the attention coefficients corresponding to the first n-1 characters after the self-attention layer in the speech recognition decoder performs attention calculations on the first n-1 characters respectively; The vector representation of the attention coefficients corresponding to the first n-1 characters is input into the text encoder to obtain the third vector representation corresponding to the n-1th character output by the text encoder.
4. The method according to claim 3, characterized in that, The text encoder includes at least one cascaded encoder, each of which includes an attention layer and a feedforward neural network layer.
5. The method according to claim 1, characterized in that, The speech recognition decoder includes multiple cascaded decoders, each of which includes an attention layer and a feedforward neural network layer; The step of inputting the first vector representation, the first n-1 characters, and the relevance coefficient as a weighted sum of the feature vectors of the multiple speakers' profiles into the speech recognition decoder to obtain the nth character includes: The first n-1 characters are each embedded and encoded to obtain the embedding vector representation corresponding to each of the first n-1 characters; The embedding vector representations corresponding to the first n-1 characters and the first vector representation are input into the first layer decoder of the speech recognition decoder to obtain the attention coefficient weighted vector representation of the n-1 character output by the attention layer in the first layer decoder; The weighted vector representation of the attention coefficient corresponding to the (n-1)th character and the weighted sum are input into the feedforward neural network layer in the first layer decoder to obtain the vector representation of the (n-1)th character input to the second layer decoder; The nth character is determined based on the vector representation corresponding to the (n-1)th character output by the last layer decoder in the speech recognition decoder.
6. The method according to claim 1, characterized in that, The speaker decoder includes multiple cascaded decoders, each of which includes an attention layer and a feedforward neural network layer; The step of inputting the first vector representation, the second vector representation, and the third vector representation into the speaker decoder to obtain the speaker vector representation corresponding to the nth word includes: The first vector representation, the second vector representation, and the third vector representation are input into the first layer decoder of the speaker decoder to obtain the speaker vector representation weighted by the attention coefficients corresponding to the (n-1)th word output by the attention layer of the first layer decoder; The speaker vector representation weighted by the attention coefficients corresponding to the (n-1)th character is input into the feedforward neural network layer in the first layer decoder to obtain the speaker vector representation corresponding to the (n-1)th character input to the second layer decoder; Determine the speaker vector representation corresponding to the (n-1)th word output by the last layer decoder in the speaker decoder; The speaker vector representation corresponding to the nth character is determined based on the speaker vector representation corresponding to the (n-1)th character output by the last layer decoder and the speaker vector representation corresponding to the (n-1)th character output by the attention layer in the first layer decoder after weighting by the attention coefficients.
7. A speech recognition method, characterized in that, The method includes: The receiving terminal device triggers a request by calling a speech recognition service, the request including speech signals of multiple speakers and profile feature vectors of the multiple speakers, the speech signals containing the speech of the multiple speakers; Based on the computing resources corresponding to the model training service, perform the following steps: Obtain a speech recognition model, which includes a speech recognition encoder, a speaker encoder, a speech recognition decoder, a speaker decoder, and a text encoder containing an attention layer; The speech recognition encoder obtains a first vector representation corresponding to the speech signal, and the speaker encoder obtains a second vector representation corresponding to the speech signal. The first vector representation is used for speech recognition, and the second vector representation is used for speaker recognition. The text encoder encodes the first n-1 characters output by the speech recognition decoder to obtain the third vector representation corresponding to the n-1th character; The first vector representation, the second vector representation, and the third vector representation are input into the speaker decoder to obtain the speaker vector representation corresponding to the nth word; The speaker corresponding to the nth character is determined based on the correlation coefficients between the speaker vector representation corresponding to the nth character and the profile feature vectors of the multiple speakers. The first vector representation, the first n-1 characters, and the weighted sum of the correlation coefficients on the profile feature vectors of the multiple speakers are input into the speech recognition decoder to obtain the nth character. Determine the complete text sequence output by the speech recognition decoder; The text sequence is encoded by the text encoder to obtain the fourth vector representation of each character in the text sequence; The first vector representation, the second vector representation, and the fourth vector representation are input into the speaker decoder to obtain the speaker vector representation corresponding to each word; Based on the correlation coefficients between the speaker vector representations corresponding to each character and the portrait feature vectors of the multiple speakers, the speaker corresponding to each character is re-determined. The voice recognition output information is sent to the terminal device, and the voice recognition output information includes the text sequence corresponding to each of the plurality of speakers.
8. An electronic device, characterized in that, include: The system includes a memory, a processor, and a communication interface; wherein the memory stores executable code, and when the executable code is executed by the processor, the processor performs the speech recognition method as described in any one of claims 1 to 6.
9. A non-transitory machine-readable storage medium, characterized in that, The non-transitory machine-readable storage medium stores executable code that, when executed by a processor of an electronic device, causes the processor to perform the speech recognition method as described in any one of claims 1 to 6.