An end-to-end speech recognition method based on an improved Transformer structure

By combining an improved Transformer architecture with VAD endpoint detection and CAM++ speaker differentiation module, the problems of insufficient endpoint detection accuracy and confusion in multi-speaker scenarios in end-to-end speech recognition are solved, improving the processing efficiency and recognition accuracy of long speech sequences, and making it suitable for complex acoustic environments and multi-person dialogue scenarios.

CN122201257APending Publication Date: 2026-06-12NANJING XIANZHI INFORMATION TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
NANJING XIANZHI INFORMATION TECH CO LTD
Filing Date
2026-04-14
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing end-to-end speech recognition technologies suffer from problems such as insufficient endpoint detection accuracy, confusion in multi-speaker scenarios, and low processing efficiency of long speech sequences in complex scenarios, especially in low signal-to-noise ratio and multi-person dialogue scenarios where recognition accuracy and robustness are insufficient.

Method used

An improved Transformer architecture is adopted, which combines the VAD endpoint detection model and the CAM++ speaker differentiation module. The endpoint detection, speaker differentiation and speech recognition are optimized through joint training. Locality-sensitive multi-head self-attention and relative position encoding are used to improve modeling efficiency, and text is generated through autoregressive decoding.

🎯Benefits of technology

It significantly improves recognition accuracy and robustness in scenarios with multiple speakers, low signal-to-noise ratio, and long speech, meeting the accuracy and reliability requirements of real-time voice interaction.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122201257A_ABST
    Figure CN122201257A_ABST
Patent Text Reader

Abstract

The application discloses an end-to-end speech recognition method based on an improved Transform structure, comprising the following steps: S1, pre-processing an input speech signal; S2, inputting a VAD endpoint detection model to obtain a feature sequence of an effective speech segment; S3, inputting a CAM++ model, extracting speaker features by using a convolutional neural network and weighting and aggregating by using an attention mechanism; S4, feature sequence fusion; S5, inputting an improved Transform model encoder to perform relative position coding, local sensitive multi-head self-attention, a feedforward neural network and layer normalization, and outputting a coding feature sequence; S6, inputting a previous moment text embedding and a speaker embedding vector into an improved Transform model decoder to interact with the coding feature sequence, and generating a target text sequence; S7, outputting a recognition text result; and S8, jointly training the model. The application significantly improves the recognition accuracy and the efficiency of the recognition process in a multi-speaker and long speech scenario.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of speech recognition and artificial intelligence, and in particular to an end-to-end speech recognition method based on an improved Transformer structure. Background Technology

[0002] With the rapid development of artificial intelligence and voice interaction technologies, end-to-end speech recognition methods have gradually become the core direction of speech recognition research, widely applied in scenarios such as intelligent customer service, smart homes, in-vehicle voice assistants, and meeting transcription. End-to-end speech recognition directly maps speech signals to target text through deep neural networks, avoiding the complex design of multiple cascaded modules such as acoustic models, language models, and decoders in traditional methods. It has the advantages of unified modeling, automated feature learning, and simple deployment. In recent years, the Transformer architecture, with its powerful self-attention mechanism, has made breakthrough progress in the field of natural language processing and has been introduced into end-to-end speech recognition systems to improve long-distance dependency modeling capabilities and recognition accuracy. However, existing end-to-end speech recognition technologies still have many limitations in complex scenarios.

[0003] First, in the speech endpoint detection stage, existing methods typically use an independent VAD model to perform binary classification of the speech stream and determine the start and end frames of valid speech segments. Because the VAD model is independent of the speech recognition backbone network, joint training and parameter sharing are not achieved. Its detection results are easily affected by noisy environments, leading to false positives or false negatives. This results in invalid silence frames entering subsequent recognition modules, increasing computational overhead and reducing recognition accuracy. Especially in low signal-to-noise ratio scenarios, endpoint detection accuracy drops significantly, impacting subsequent feature extraction and recognition performance.

[0004] Secondly, in multi-speaker speech scenarios, traditional end-to-end recognition systems often employ speaker-independent modeling methods without introducing dedicated speaker feature extraction and differentiation mechanisms. The acoustic features of different speakers are prone to interference, leading to cross-identification errors or role misplacement in the recognition results. This significantly reduces the robustness and practicality of the system in complex scenarios such as multi-person dialogues, meeting recordings, and mixed dialects. Although some research has attempted to combine speaker recognition networks or external embedding vectors, these methods mostly employ post-processing fusion, lacking end-to-end joint optimization and failing to fully leverage the auxiliary role of speaker features in recognition and decoding.

[0005] Furthermore, the traditional Transformer architecture suffers from high computational complexity and insufficient contextual modeling when processing long speech sequences. Its multi-head self-attention mechanism's computational complexity increases quadratically with the input sequence length, making training and inference too costly in long speech or real-time recognition scenarios, failing to meet the demands of low-latency applications. Simultaneously, the Transformer based on absolute positional encoding suffers from positional information dilution in long sequences, resulting in insufficient ability to capture cross-temporal semantic dependencies, thus affecting the accuracy and contextual consistency of long speech recognition.

[0006] Therefore, how to provide an end-to-end speech recognition method based on an improved Transformer structure is a problem that urgently needs to be solved by those skilled in the art. Summary of the Invention

[0007] One objective of this invention is to propose an end-to-end speech recognition method based on an improved Transformer structure. This invention improves the Transformer structure to enhance its processing efficiency and context modeling capabilities for long speech sequences. By superimposing a VAD endpoint detection model and a speaker differentiation model based on CAM++, the invention achieves synergistic optimization of endpoint detection, speaker differentiation, and speech recognition, thereby improving the accuracy and robustness of speech recognition in complex scenarios.

[0008] An end-to-end speech recognition method based on an improved Transformer architecture according to an embodiment of the present invention includes the following steps: S1. Preprocess the input speech signal, extract speech feature parameters, and form a speech feature sequence; S2. Input the speech feature sequence into the VAD endpoint detection model to determine the start and end frames of the speech, and obtain the feature sequence of the effective speech segment; S3. Input the feature sequence of the effective speech segment into the speaker discrimination module based on the CAM++ model, extract speaker features using a convolutional neural network and aggregate them through a weighted attention mechanism, and output the speaker embedding vector. S4. The speaker embedding vector is fused with the feature sequence of the effective speech segment to obtain a fused speech feature sequence containing speaker information; S5. Input the fused speech feature sequence into the improved Transformer model encoder, perform relative position encoding, local sensitive multi-head self-attention, feedforward neural network and layer normalization, and output the encoded feature sequence. S6. Input the text embedding from the previous time step and the speaker embedding vector into the improved Transformer model decoder, interact with the encoded feature sequence, and generate the target text sequence in an autoregressive manner; S7. Perform bundle search rescoring, punctuation restoration, and duplicate word deletion on the target text sequence, and output the recognized text result; S8. Jointly train the VAD endpoint detection model, CAM++ model, and improved Transformer model, and jointly optimize the model parameters based on the weighted total loss of VAD loss, speaker discrimination loss, and speech recognition loss.

[0009] Optionally, step S1 specifically includes: S11. Divide the input audio signal into frames. Frame shift Perform frame division; S12, For each frame of speech signal Windowing is applied, and the windowing function is defined as follows: ; in, The window function is in the first position. The weight values ​​for each point, where t represents the index of a discrete sampling point within the frame. , For the frame length, for the first... Frame speech signal The windowed frame signal is represented as follows: ; in, This represents the windowed speech signal of the i-th frame. , The original speech signal is represented by i, where i is the frame index. For frame shift, The window function is in the first position. The weight value of each point; S13. Perform a Fourier transform on the windowed frame signal to obtain the frame-level spectral amplitude; S14. Input the frame-level spectral amplitude into the Mel filter bank, wherein the number of Mel filter banks is 40 and the frequency range is 20Hz to 4kHz, and calculate the energy of the Mel filter bank. S15. Take the logarithm of the energy of the Mel filter bank and perform a discrete cosine transform to extract the 13-dimensional Mel frequency cepstral coefficients and their first-order differences, forming a feature vector containing static and dynamic features. And arrange them in chronological order to form a speech feature sequence X.

[0010] Optionally, step S2 specifically includes: S21. Divide the speech feature sequence into input sub-sequences by a 5-frame sliding window; S22. Input the input subsequence into the VAD endpoint detection model based on bidirectional LSTM, and output the probability of speech presence in each frame. ; S23. Determine the start and end frames of the speech based on the probability of each speech frame, and obtain the effective speech segment feature sequence. ; S24. Define the label sequence for the valid speech segment: ; in, The label sequence representing the effective speech segment. This represents the tag sequence of frame t. This indicates that frame t is a speech frame. This indicates that frame t is a silent frame, and T is the total number of frames.

[0011] Optionally, step S3 specifically includes: S31. Input the effective speech segment feature sequence into the speaker differentiation module based on the CAM++ structure; S32. Perform convolution operations on the frame features using a multi-layer convolutional neural network, including 3 convolutional layers with a kernel size of [missing value]. Step size is 2, number of output channels Frame-level acoustic feature maps are obtained: ; in, Let represent the frame-level acoustic feature vector obtained in frame t. CNN uses convolution operations. This represents the audio signal of frame t after windowing processing; S33. Calculate the global average pooling characteristics: ; in, This represents the global average pooled feature vector obtained by aggregating all frame-level features. This indicates the total number of frames in the valid speech segments. This represents the frame-level acoustic feature vector obtained in frame t; Attention weights are generated using two fully connected layers: ; in, This represents the attention coefficient in frame t. This represents the natural exponential function. This indicates that g performs a fully connected layer mapping. This represents the global average pooling feature vector. This indicates that a fully connected layer is used to map the frame features. Let represent the frame-level acoustic feature vector obtained in frame t. This indicates the total number of frames in the valid speech segments. Indicates the first Frame-level acoustic features of valid frames; S34. Weighted aggregation of frame-level features yields the speaker embedding vector: ; in, This represents the speaker embedding vector. This indicates the total number of frames in the valid speech segments. This represents the attention coefficient in frame t. Let represent the frame-level acoustic feature vector obtained in frame t.

[0012] Optionally, step S4 specifically includes: S41. Fuse the speaker embedding vector with the feature sequence of the effective speech segment to obtain the fused frame features: ; in, Indicates the features of the fused frame. This represents the concatenation of eigenvectors. Represents the feature sequence of a valid speech segment. This represents the speaker embedding vector; S42. Perform linear transformation and normalization on the fused frame features to obtain the final... A fusion speech feature sequence containing speaker information: ; in, , For the embedded vector dimension, For the original dimension, This represents the total dimension after merging.

[0013] Optionally, step S5 specifically includes: S51. Input the fused feature sequence into a multi-layer stacked improved Transformer encoder, using a three-layer hierarchical Transformer encoder, each layer including a local sensitive multi-head self-attention sub-layer and a feedforward neural network sub-layer. S52. Add relative position encoding to the features of each frame. The encoded input is represented as follows: EncInput

[0014] ; Among them, EncInput Let t be the encoder input vector for the t-th frame. Let be the feature vector after fusion in frame t. For encoding dimensions, For the first The position vector of the frame; S53. In the multi-head self-attention calculation of each layer, construct the query matrix. Key matrix Value matrix ; S54. For each frame, limit the attention calculation range to a local window of k frames to the left and right of that frame, and calculate the local sensitive multi-head attention: ; in, For the first Frame query vector, This represents the transpose matrix of the key vectors from frame i−k to i+k within the local window. Let be the dimension of the key vector. Let i be the set of value vectors from i−k to i+k within the window, and let the non-window regions be set to zero using a mask. This indicates a normalization operation; S55. Concatenate the outputs of all attention heads to obtain multi-head attention output. After linear transformation, residual connection and layer normalization, multi-head attention sub-layer output is formed. S56. Input the output of the multi-head attention sublayer into the feedforward neural network. The feedforward neural network includes two fully connected layers and a nonlinear activation function. The output is processed by residual connection and layer normalization to obtain the encoding result of the current layer. S57. Use the encoding result as the input for the next layer, and repeat the operation until the last layer to obtain the encoded feature sequence. .

[0015] Optionally, step S6 specifically includes: S61. Embed the text: ; in, Let represent the embedding vector of the i-th generated word. Represents the real number field. Word embedding dimension; Speaker embedding and decoding position encoding The vectors are fused together to form the decoder input vector.

[0016] ; in, Let represent the input vector of the decoder at time step i. Let represent the embedding vector of the i-th generated word. , This represents the decoder position encoding vector, where i is the index of the current decoding step and d is the encoding dimension; S62. Perform contextual interaction between the input vector and the encoded feature sequence to calculate cross-attention: ; in, This represents the computational result of the cross-attention mechanism. Represents the query matrix. Represents the encoded feature matrix, This indicates a normalization operation. H represents the encoder output as the key matrix, and T represents the total number of frames in the input sequence. Indicates the dimension of the key vector; S63. Input the cross-attention result into the feedforward neural network to obtain the prediction output vector at the current time step; S64. Calculate the probability distribution of the next target word using linear transformation and the Softmax function: CrossAttention_out ; in, Given the first i words Predict the next word under the given conditions. The probability distribution, This indicates a normalization operation. The output weight matrix is ​​FFN(CrossAttention_out), which is the output of the feedforward neural network. This is the output bias vector; S65. Based on the probability distribution, retain candidates as the results generated in the current step, and use the embedding vector of the results generated in the current step as the input for the next step. Recursively generate a sequence end marker to obtain the complete target text sequence. .

[0017] Optionally, step S7 specifically includes: S71. In the autoregressive decoding process, a bundle search is performed on the candidate set of the target text sequence, the bundle width B is set, and the first B candidate sequences are retained according to the probability of the candidate sequences. S72. Calculate the score for the candidate sequence and re-score it, and determine the output sequence according to the re-score result; S73. Perform punctuation recovery on the output sequence and insert punctuation marks into the word sequence according to the preset punctuation recovery model; S74. Perform duplicate word deletion on the text after punctuation restoration, remove continuously repeated words or meaningless filler words, and output the final recognized text result.

[0018] Optionally, step S8 specifically includes: S81. Calculate the loss function of the VAD model as the binary cross-entropy: ; in, This represents the loss function of the VAD model. This indicates that the total time steps T are normalized. This represents the frame label sequence corresponding to time t. Let be the speech probability of the frame corresponding to time t. Represents a speech frame. Indicates a silent frame; S82, reduce the endpoint detection loss With speech recognition loss and the speaker distinguishes the loss By weighting factor Weighted summation yields the total loss for joint training: ; in, Indicates the total losses from joint training. The weights are non-negative real numbers, corresponding to the speech recognition loss respectively. Endpoint detection loss Weighting of loss from the speaker ; S83, Based on the total loss The parameters of the VAD endpoint detection model, the CAM++ speaker differentiation module, and the improved Transformer model are updated through backpropagation. S84. Determine whether the training has converged according to the preset convergence criterion. If converged, output and save the trained model; otherwise, continue iterative execution.

[0019] The beneficial effects of this invention are: This invention addresses the shortcomings of existing end-to-end speech recognition methods, such as insufficient endpoint detection accuracy, confusion in multi-speaker scenarios, and low efficiency in processing long sequences, by deeply integrating a bidirectional LSTM-based VAD endpoint detection model, a speaker differentiation module based on the CAM++ architecture, and an improved Transformer encoder. It proposes a joint training multi-task learning framework to achieve coordinated optimization of endpoint detection, speaker differentiation, and speech recognition. In the endpoint detection stage, frame-level sliding window input and bidirectional LSTM network modeling of temporal context features significantly improve the accuracy of start and end frame localization in noisy environments and reduce the transmission of invalid silence frames. In the speaker differentiation stage, a multi-layer convolutional neural network is used to extract frame-level acoustic features, and speaker embedding vectors are generated through attention weighted aggregation. These vectors are then fused with effective speech segment features, enabling the current speaker information to be perceived during decoding, significantly reducing cross-identification errors in multi-speaker scenarios. In the feature modeling stage, the improved Transformer encoder introduces relative position encoding and a locality-sensitive multi-head self-attention mechanism, confining attention calculations within a local window. This reduces computational complexity from quadratic to linear or log-linear levels, improving the modeling efficiency and context capture accuracy of long speech sequences. During joint training, the weighted sum of VAD loss, speaker discrimination loss, and speech recognition loss is used as the optimization objective. Synchronous convergence of these three losses is achieved through parameter sharing and backpropagation, enabling endpoint detection and speaker modeling to proactively adapt to the requirements of the recognition task. The final output recognized text undergoes beam search re-scoring and post-processing to ensure coherence, proper punctuation, and removal of duplicate words. This invention demonstrates significant advantages in multi-person dialogue, low signal-to-noise ratio, and long speech scenarios, effectively meeting the accuracy and robustness requirements of real-time voice interaction. Attached Figure Description

[0020] The accompanying drawings are provided to further illustrate the invention and form part of the specification. They are used in conjunction with embodiments of the invention to explain the invention and do not constitute a limitation thereof. In the drawings: Figure 1 This is a schematic diagram of the overall process of an end-to-end speech recognition method based on an improved Transformer structure proposed in this invention; Figure 2 This is a schematic diagram of the overall structure of an end-to-end speech recognition method based on an improved Transformer structure proposed in this invention. Figure 3 This is a flowchart of the multi-task joint training process in an end-to-end speech recognition method based on an improved Transformer structure proposed in this invention. Detailed Implementation

[0021] The present invention will now be described in further detail with reference to the accompanying drawings. These drawings are simplified schematic diagrams, illustrating only the basic structure of the invention, and therefore only show the components relevant to the invention.

[0022] refer to Figure 1-3 An end-to-end speech recognition method based on an improved Transformer architecture includes the following steps: S1. Preprocess the input speech signal, extract speech feature parameters, and form a speech feature sequence; S2. Input the speech feature sequence into the VAD endpoint detection model to determine the start and end frames of the speech, and obtain the feature sequence of the effective speech segment; S3. Input the feature sequence of the effective speech segment into the speaker discrimination module based on the CAM++ model, extract speaker features using a convolutional neural network and aggregate them through a weighted attention mechanism, and output the speaker embedding vector. S4. The speaker embedding vector is fused with the feature sequence of the effective speech segment to obtain a fused speech feature sequence containing speaker information; S5. Input the fused speech feature sequence into the improved Transformer model encoder, perform relative position encoding, local sensitive multi-head self-attention, feedforward neural network and layer normalization, and output the encoded feature sequence. S6. Input the text embedding from the previous time step and the speaker embedding vector into the improved Transformer model decoder, interact with the encoded feature sequence, and generate the target text sequence in an autoregressive manner; S7. Perform bundle search rescoring, punctuation restoration, and duplicate word deletion on the target text sequence, and output the recognized text result; S8. Jointly train the VAD endpoint detection model, CAM++ model, and improved Transformer model, and jointly optimize the model parameters based on the weighted total loss of VAD loss, speaker discrimination loss, and speech recognition loss.

[0023] This invention proposes an end-to-end speech recognition method based on an improved Transformer architecture. It accurately extracts effective speech segments through bidirectional LSTM-based VAD endpoint detection, introduces a CAM++ model to extract speaker embeddings and fuse them with speech features, and uses an improved Transformer encoder that combines locality-sensitive multi-head self-attention and relative position encoding to improve the efficiency of long sequence modeling. The decoder generates text using an autoregressive approach, and the results are re-scored and output after beam search. Through multi-task joint training, the parameters of the VAD, speaker discrimination, and recognition models are simultaneously optimized, significantly improving recognition accuracy, endpoint detection precision, and long speech processing efficiency in multi-speaker scenarios.

[0024] In this embodiment, step S1 specifically includes: S11. Divide the input audio signal into frames. Frame shift Perform frame division; S12, For each frame of speech signal Windowing is applied, and the windowing function is defined as follows: ; in, The window function is in the first position. The weight values ​​for each point, where t represents the index of a discrete sampling point within the frame. , For the frame length, for the first... Frame speech signal The windowed frame signal is represented as follows: ; in, This represents the windowed speech signal of the i-th frame. , The original speech signal is represented by i, where i is the frame index. For frame shift, The window function is in the first position. The weight value of each point; S13. Perform a Fourier transform on the windowed frame signal to obtain the frame-level spectral amplitude; S14. Input the frame-level spectral amplitude into the Mel filter bank, wherein the number of Mel filter banks is 40 and the frequency range is 20Hz to 4kHz, and calculate the energy of the Mel filter bank. S15. Take the logarithm of the energy of the Mel filter bank and perform a discrete cosine transform to extract the 13-dimensional Mel frequency cepstral coefficients and their first-order differences, forming a feature vector containing static and dynamic features. And arrange them in chronological order to form a speech feature sequence X.

[0025] This invention first divides the input speech signal into frames according to a set frame length and frame shift, and applies a windowing function to each frame to reduce edge effects, forming a windowed frame signal. Then, a Fast Fourier Transform is performed on each frame to obtain the frame-level spectral amplitude, and a 40-dimensional Mel filter bank energy is calculated using a Mel filter bank, covering a frequency range of 20Hz to 4kHz. Next, the logarithm of the filter bank energy is taken and a Discrete Cosine Transform is performed to extract 13-dimensional Mel frequency cepstral coefficients and their first-order difference features, forming a 26-dimensional feature vector containing both static and dynamic information. Arranging these features in chronological order yields a complete speech feature sequence. This preprocessing step ensures that the input features for subsequent endpoint detection, speaker feature extraction, and the Transformer encoder have high time-frequency resolution and robustness, providing stable and reliable input data for the entire recognition process.

[0026] In this embodiment, step S2 specifically includes: S21. Divide the speech feature sequence into input sub-sequences by a 5-frame sliding window; S22. Input the input subsequence into the VAD endpoint detection model based on bidirectional LSTM, and output the probability of speech presence in each frame. ; S23. Determine the start and end frames of the speech based on the probability of each speech frame, and obtain the effective speech segment feature sequence. ; S24. Define the label sequence for the valid speech segment: ; in, The label sequence representing the effective speech segment. This represents the tag sequence of frame t. This indicates that frame t is a speech frame. This indicates that frame t is a silent frame, and T is the total number of frames.

[0027] This invention divides the speech feature sequence into a sliding window of a set length to form a continuous input subsequence. Then, the input subsequence is fed into a bidirectional LSTM endpoint detection model, which calculates the speech presence probability for each frame using forward and backward time-series features. Based on the output probability, the start and end frames of the speech are determined, resulting in a valid speech segment feature sequence after removing silence frames. A corresponding frame-level label sequence is then generated as a training supervision signal. By modeling the bidirectional dependencies of the context using bidirectional LSTM, this method can accurately locate speech boundaries in noisy environments, reduce the transmission of invalid silence frames to subsequent modules, reduce computational redundancy, and improve overall recognition efficiency and endpoint detection accuracy.

[0028] In this embodiment, step S3 specifically includes: S31. Input the effective speech segment feature sequence into the speaker differentiation module based on the CAM++ structure; S32. Perform convolution operations on the frame features using a multi-layer convolutional neural network, including 3 convolutional layers with a kernel size of [missing value]. Step size is 2, number of output channels Frame-level acoustic feature maps are obtained: ; in, Let represent the frame-level acoustic feature vector obtained in frame t. CNN uses convolution operations. This represents the audio signal of frame t after windowing processing; S33. Calculate the global average pooling characteristics: ; in, This represents the global average pooled feature vector obtained by aggregating all frame-level features. This indicates the total number of frames in the valid speech segments. This represents the frame-level acoustic feature vector obtained in frame t; Attention weights are generated using two fully connected layers: ; in, This represents the attention coefficient in frame t. This represents the natural exponential function. This indicates that g performs a fully connected layer mapping. This represents the global average pooling feature vector. This indicates that a fully connected layer is used to map the frame features. Let represent the frame-level acoustic feature vector obtained in frame t. This indicates the total number of frames in the valid speech segments. Indicates the first Frame-level acoustic features of valid frames; S34. Weighted aggregation of frame-level features yields the speaker embedding vector: ; in, This represents the speaker embedding vector. This indicates the total number of frames in the valid speech segments. This represents the attention coefficient in frame t. Let represent the frame-level acoustic feature vector obtained in frame t.

[0029] This invention inputs the effective speech segment feature sequence into the convolutional neural network layer of the CAM++ model, sequentially performing multi-layer convolution operations and downsampling to extract frame-level acoustic feature maps, retaining spectral spatial information containing speaker features. Subsequently, global average pooling features are calculated, and frame-level attention weights are generated through two fully connected layers. These weights are then used to weighted aggregate the frame-level acoustic features, obtaining an embedding vector representing the current speaker's features. This embedding vector is then concatenated and fused with the effective speech segment feature sequence to form a fused feature sequence containing speaker information, used for subsequent contextual modeling by the Transformer encoder. Through this step, the system can extract and utilize speaker features within an end-to-end framework, achieving discriminative modeling in multi-speaker scenarios, significantly reducing the recognition confusion rate during role switching, and improving the robustness and accuracy of recognition in multi-person dialogue and conference scenarios.

[0030] In this embodiment, step S4 specifically includes: S41. Fuse the speaker embedding vector with the feature sequence of the effective speech segment to obtain the fused frame features: ; in, Indicates the features of the fused frame. This represents the concatenation of eigenvectors. Represents the feature sequence of a valid speech segment. This represents the speaker embedding vector; S42. Perform linear transformation and normalization on the fused frame features to obtain the final... A fusion speech feature sequence containing speaker information: ; in, , For the embedded vector dimension, For the original dimension, This represents the total dimension after merging.

[0031] This invention concatenates or weights the speaker embedding vector obtained from the CAM++ module with the frame-level features of the effective speech segments, aligning them frame by frame to form a fused frame feature containing speaker information. Subsequently, linear transformation and normalization operations are performed on the fused frame features, mapping the extended high-dimensional features after concatenation to the target embedding space, ensuring the stability of the feature distribution and the consistency of the numerical range, thus obtaining the final fused speech feature sequence. This fused sequence preserves the time-frequency information and individualized speaker features of the speech, enabling subsequent features input to the improved Transformer encoder to not only contain acoustic content but also explicitly incorporate speaker context, which helps improve the discriminative power of long sequence modeling and the recognition accuracy in multi-speaker scenarios.

[0032] In this embodiment, step S5 specifically includes: S51. Input the fused feature sequence into a multi-layer stacked improved Transformer encoder, using a three-layer hierarchical Transformer encoder, each layer including a local sensitive multi-head self-attention sub-layer and a feedforward neural network sub-layer. S52. Add relative position encoding to the features of each frame. The encoded input is represented as follows: EncInput

[0033] ; Among them, EncInput Let t be the encoder input vector for the t-th frame. Let be the feature vector after fusion in frame t. For encoding dimensions, For the first The position vector of the frame; S53. In the multi-head self-attention calculation of each layer, construct the query matrix. Key matrix Value matrix ; S54. For each frame, limit the attention calculation range to a local window of k frames to the left and right of that frame, and calculate the local sensitive multi-head attention: ; in, For the first Frame query vector, This represents the transpose matrix of the key vectors from frame i−k to i+k within the local window. Let be the dimension of the key vector. Let i be the set of value vectors from i−k to i+k within the window, and let the non-window regions be set to zero using a mask. This indicates a normalization operation; S55. Concatenate the outputs of all attention heads to obtain multi-head attention output. After linear transformation, residual connection and layer normalization, multi-head attention sub-layer output is formed. S56. Input the output of the multi-head attention sublayer into the feedforward neural network. The feedforward neural network includes two fully connected layers and a nonlinear activation function. The output is processed by residual connection and layer normalization to obtain the encoding result of the current layer. S57. Use the encoding result as the input for the next layer, and repeat the operation until the last layer to obtain the encoded feature sequence. .

[0034] This invention inputs a fused feature sequence into an improved Transformer encoder with multiple stacked layers. Each layer includes a locally sensitive multi-head self-attention sublayer and a feedforward neural network sublayer. During the input phase, relative position encoding is added to the features of each frame, enabling the encoder to enhance its temporal context modeling capabilities by utilizing relative distance information between frames. In the multi-head self-attention computation, a query, key, and value matrix is ​​constructed by performing a linear transformation on the fused feature sequence. The receptive field of the attention computation is limited to local windows of k frames to the left and right of the current frame. Elements outside the window are masked to reduce irrelevant interference, resulting in the locally sensitive multi-head attention output. The outputs of all attention heads are concatenated and then subjected to linear transformation, residual connections, and layer normalization to obtain the output of the current sublayer. This output is then input into the feedforward neural network for two fully connected layers and nonlinear activation. Finally, residual connections and layer normalization are applied to obtain the encoding result of the current layer. Through multi-layer iterative stacking, the final encoded feature sequence is output. This improved encoder significantly reduces computational complexity through local window constraints, decreasing it from the traditional quadratic level to linear or log-linear levels. Combined with relative position encoding, it enhances the ability to capture semantic dependencies across time periods, effectively improving the efficiency and recognition accuracy of long-sequence speech modeling.

[0035] In this embodiment, step S6 specifically includes: S61. Embed the text: ; in, Let represent the embedding vector of the i-th generated word. Represents the real number field. Word embedding dimension; Speaker embedding and decoding position encoding The vectors are fused together to form the decoder input vector.

[0036] ; in, Let represent the input vector of the decoder at time step i. Let represent the embedding vector of the i-th generated word. , This represents the decoder position encoding vector, where i is the index of the current decoding step and d is the encoding dimension; S62. Perform contextual interaction between the input vector and the encoded feature sequence to calculate cross-attention: ; in, This represents the computational result of the cross-attention mechanism. Represents the query matrix. Represents the encoded feature matrix, This indicates a normalization operation. H represents the encoder output as the key matrix, and T represents the total number of frames in the input sequence. Indicates the dimension of the key vector; S63. Input the cross-attention result into the feedforward neural network to obtain the prediction output vector at the current time step; S64. Calculate the probability distribution of the next target word using linear transformation and the Softmax function: CrossAttention_out ; in, Given the first i words Predict the next word under the given conditions. The probability distribution, This indicates a normalization operation. The output weight matrix is ​​FFN(CrossAttention_out), which is the output of the feedforward neural network. This is the output bias vector; S65. Based on the probability distribution, retain candidates as the results generated in the current step, and use the embedding vector of the results generated in the current step as the input for the next step. Recursively generate a sequence end marker to obtain the complete target text sequence. .

[0037] This invention fuses the text embedding vector and speaker embedding vector generated in the previous time step with the decoding position encoding to form the decoding input vector for the current time step. Then, this input vector is interacted with the encoded feature sequence in context, and cross-attention is calculated to obtain the encoded feature representation most relevant to the current time step. The cross-attention output is then input into a feedforward neural network, undergoing linear transformation and nonlinear activation to obtain the prediction vector for the current time step. The probability distribution of the next target word is then calculated using a Softmax function. Candidate words are selected as the current output based on the probability distribution, and their embedding vectors are used as the input for the next time step. This process is recursively executed until the end-of-sequence symbol is generated, resulting in a complete target text sequence. Through this autoregressive word-by-word generation method, the decoder can fully utilize contextual information and speaker embeddings, improving the semantic coherence and discriminative ability of the output text in multi-speaker scenarios.

[0038] In this embodiment, step S7 specifically includes: S71. In the autoregressive decoding process, a bundle search is performed on the candidate set of the target text sequence, the bundle width B is set, and the first B candidate sequences are retained according to the probability of the candidate sequences. S72. Calculate the score for the candidate sequence and re-score it, and determine the output sequence according to the re-score result; S73. Perform punctuation recovery on the output sequence and insert punctuation marks into the word sequence according to the preset punctuation recovery model; S74. Perform duplicate word deletion on the text after punctuation restoration, remove continuously repeated words or meaningless filler words, and output the final recognized text result.

[0039] In the autoregressive decoding process, a beam search is performed on the candidate sequences. A beam width B is set, and the top B candidate sequences are retained based on probability. Subsequently, scores are calculated for each candidate sequence, and a re-scoring is performed. The optimal output sequence is selected based on the score, thereby reducing recognition errors caused by low-probability paths. For the selected output sequence, appropriate punctuation marks are further inserted into the word sequence using a punctuation recovery model, making the output text closer to natural language expression. Finally, duplicate word removal is performed to remove consecutively repeated words or meaningless filler words, resulting in a cleaner and more readable recognized text. This post-processing step effectively reduces noise and redundancy in the recognition results, improving the accuracy, fluency, and usability of the output text.

[0040] In this embodiment, step S8 specifically includes: S81. Calculate the loss function of the VAD model as the binary cross-entropy: ; in, This represents the loss function of the VAD model. This indicates that the total time steps T are normalized. This represents the frame label sequence corresponding to time t. Let be the speech probability of the frame corresponding to time t. Represents a speech frame. Indicates a silent frame; S82, reduce the endpoint detection loss With speech recognition loss and the speaker distinguishes the loss By weighting factor Weighted summation yields the total loss for joint training: ; in, Indicates the total losses from joint training. The weights are non-negative real numbers, corresponding to the speech recognition loss respectively. Endpoint detection loss Weighting of loss from the speaker ; S83, Based on the total loss The parameters of the VAD endpoint detection model, the CAM++ speaker differentiation module, and the improved Transformer model are updated through backpropagation. S84. Determine whether the training has converged according to the preset convergence criterion. If converged, output and save the trained model; otherwise, continue iterative execution.

[0041] This invention calculates the binary cross-entropy loss of the VAD endpoint detection model to measure the consistency between the predicted frame and the ground truth frame label. Then, this endpoint detection loss is weighted and summed with the speech recognition cross-entropy loss and the speaker discrimination loss according to set weight coefficients to form a joint training total loss function. Using this total loss as the optimization objective, the parameters of the VAD endpoint detection model, the CAM++ speaker discrimination module, and the improved Transformer recognition model are simultaneously updated using the backpropagation algorithm, ensuring that endpoint detection, speaker embedding extraction, and text recognition converge collaboratively during the same training process. During training, a preset convergence criterion is used to determine whether the convergence condition has been met. After convergence, the trained model parameters are output and saved. This joint training method enables information sharing and gradient collaboration among multiple tasks, making VAD detection more aligned with recognition needs, speaker embedding more discriminative, and the recognition model converge faster, ultimately significantly improving the overall recognition accuracy and robustness in complex acoustic environments and multi-speaker scenarios.

[0042] Example 1: To verify the feasibility and effectiveness of this invention in a real-world scenario, it was applied to a multi-speaker conference speech recognition system. In this scenario, the conference had 5-8 participants, lasted one hour, and was conducted in a typical conference room with background noise (approximately 45dB from air conditioning and 60dB from keyboard typing). Multiple speakers frequently alternated speaking and interrupted. Traditional speech recognition systems often rely on endpoint detection algorithms with fixed thresholds, resulting in a high false detection rate in noisy environments. This leads to inaccurate segmentation of effective speech segments, affecting downstream recognition accuracy. Furthermore, recognition models that do not consider speaker embedding information often suffer from role confusion in multi-speaker scenarios, resulting in transcribed text lacking speaker labels, which is detrimental to subsequent conference record analysis. In addition, traditional models typically train endpoint detection, acoustic models, and language models separately, lacking joint optimization, limiting overall recognition performance, especially in long speech and multi-speaker scenarios.

[0043] In this embodiment, the acquired multi-speaker speech signals are first preprocessed. Speech features are extracted through pre-emphasis filtering, framing, windowing, and Mel filter banks to form a feature sequence X containing 26-dimensional static and dynamic features. This feature sequence is then input into a VAD endpoint detection model based on a bidirectional recursive structure. A sliding window is used to predict the probability of each speech frame and determine the start and end frames of the speech, thus obtaining the feature sequence of the effective speech segments. This avoids the problems of false triggering and incomplete speech segment segmentation caused by background noise. Based on this, a speaker differentiation module based on the CAM++ architecture is used to extract frame-level acoustic features. Frame weights are calculated using an attention mechanism and weighted aggregation is performed to obtain the speaker embedding vector s. This s is then fused with the effective speech segment feature sequence to obtain a fused speech feature sequence containing speaker information. This effectively enhances the ability to distinguish speakers in multi-speaker scenarios.

[0044] The fused feature sequence is input into the encoder of the improved Transformer model. This encoder employs a locality-sensitive multi-head self-attention mechanism, which reduces the computational complexity of long sequence modeling by limiting the attention window to focus on features in adjacent frames and enhances the model's ability to capture frame order relationships through relative position encoding. After multiple layers of encoding, a context-enhanced encoded feature sequence is obtained and fed into the decoder to progressively generate the target text sequence in an autoregressive manner. During the generation process, a beam search strategy is used to re-score candidate sequences, retaining the highest-scoring candidates, which significantly reduces the perplexity of the language model and improves the fluency of text generation.

[0045] To verify the beneficial effects of this invention, a comparative experiment with a traditional speech recognition system was designed. The experiment used a multi-speaker conference dataset, with the training set containing 2000 hours of speech and the test set containing 50 hours of multi-speaker speech, and the noise environment ranging from 35dB to 60dB. The performance of the traditional recognition model and the jointly trained end-to-end recognition model proposed in this invention was compared in terms of word error rate (WER), endpoint detection accuracy, speaker discrimination accuracy, and latency. The experimental results are shown in Table 1. Table 1. Performance Comparison of Different Methods in Multi-Speaker Conference Recognition Scenarios

[0046] As shown in Table 1, the end-to-end speech recognition method based on the improved Transformer structure proposed in this invention significantly reduces the word error rate under various noise conditions, averaging about 45% lower than traditional methods. Endpoint detection accuracy is also improved by about 5%-8%, ensuring the integrity and accuracy of speech segmentation. Simultaneously, speaker discrimination accuracy is improved by an average of about 4%, allowing transcribed text in multi-speaker conference scenarios to directly include speaker tags for easier subsequent analysis. Recognition latency increases only slightly, still meeting real-time requirements. These experimental results fully demonstrate that this invention, by introducing speaker embedding fusion and multi-task joint training, improves the synergy among endpoint detection, speaker discrimination, and speech recognition. This results in a system with stronger robustness and higher recognition accuracy in complex acoustic environments and multi-speaker scenarios, significantly improving the problems of inaccurate segmentation, speaker confusion, and insufficient recognition performance in existing technologies. It has significant application value and promising prospects for widespread adoption.

[0047] The above description is only a preferred embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any equivalent substitutions or modifications made by those skilled in the art within the scope of the technology disclosed in the present invention, based on the technical solution and inventive concept of the present invention, should be covered within the scope of protection of the present invention.

Claims

1. An end-to-end speech recognition method based on an improved Transformer architecture, characterized in that, Includes the following steps: S1. Preprocess the input speech signal, extract speech feature parameters, and form a speech feature sequence; S2. Input the speech feature sequence into the VAD endpoint detection model to determine the start and end frames of the speech, and obtain the feature sequence of the effective speech segment; S3. Input the feature sequence of the effective speech segment into the speaker discrimination module based on the CAM++ model, extract speaker features using a convolutional neural network and aggregate them through a weighted attention mechanism, and output the speaker embedding vector. S4. The speaker embedding vector is fused with the feature sequence of the effective speech segment to obtain a fused speech feature sequence containing speaker information; S5. Input the fused speech feature sequence into the improved Transformer model encoder, perform relative position encoding, local sensitive multi-head self-attention, feedforward neural network and layer normalization, and output the encoded feature sequence. S6. Input the text embedding from the previous time step and the speaker embedding vector into the improved Transformer model decoder, interact with the encoded feature sequence, and generate the target text sequence in an autoregressive manner; S7. Perform bundle search rescoring, punctuation restoration, and duplicate word deletion on the target text sequence, and output the recognized text result; S8. Jointly train the VAD endpoint detection model, CAM++ model, and improved Transformer model, and jointly optimize the model parameters based on the weighted total loss of VAD loss, speaker discrimination loss, and speech recognition loss.

2. The end-to-end speech recognition method based on an improved Transformer structure according to claim 1, characterized in that, Step S1 specifically includes: S11. Divide the input audio signal into frames. Frame shift Perform frame division; S12, For each frame of speech signal Windowing is applied, and the windowing function is defined as follows: ; in, The window function is in the first position. The weight values ​​for each point, where t represents the index of a discrete sampling point within the frame. , For the frame length, for the first... Frame speech signal The windowed frame signal is represented as follows: ; in, This represents the windowed speech signal of the i-th frame. , The original speech signal is represented by i, where i is the frame index. For frame shift, The window function is in the first position. The weight value of each point; S13. Perform a Fourier transform on the windowed frame signal to obtain the frame-level spectral amplitude; S14. Input the frame-level spectral amplitude into the Mel filter bank, wherein the number of Mel filter banks is 40 and the frequency range is 20Hz to 4kHz, and calculate the energy of the Mel filter bank. S15. Take the logarithm of the energy of the Mel filter bank and perform a discrete cosine transform to extract the 13-dimensional Mel frequency cepstral coefficients and their first-order differences, forming a feature vector containing static and dynamic features. And arrange them in chronological order to form a speech feature sequence X.

3. The end-to-end speech recognition method based on an improved Transformer structure according to claim 1, characterized in that, Step S2 specifically includes: S21. Divide the speech feature sequence into input sub-sequences by a 5-frame sliding window; S22. Input the input subsequence into the VAD endpoint detection model based on bidirectional LSTM, and output the probability of speech presence in each frame. ; S23. Determine the start and end frames of the speech based on the probability of each speech frame, and obtain the effective speech segment feature sequence. ; S24. Define the label sequence for the valid speech segment: ; in, The label sequence representing the effective speech segment. This represents the tag sequence of frame t. This indicates that frame t is a speech frame. This indicates that frame t is a silent frame, and T is the total number of frames.

4. The end-to-end speech recognition method based on an improved Transformer structure according to claim 1, characterized in that, Step S3 specifically includes: S31. Input the effective speech segment feature sequence into the speaker differentiation module based on the CAM++ structure; S32. Perform convolution operations on the frame features using a multi-layer convolutional neural network, including 3 convolutional layers with a kernel size of [missing value]. Step size is 2, number of output channels Frame-level acoustic feature maps are obtained: ; in, Let represent the frame-level acoustic feature vector obtained in frame t. CNN uses convolution operations. This represents the audio signal of frame t after windowing processing; S33. Calculate the global average pooling characteristics: ; in, This represents the global average pooled feature vector obtained by aggregating all frame-level features. This indicates the total number of frames in the valid speech segments. This represents the frame-level acoustic feature vector obtained in frame t; Attention weights are generated using two fully connected layers: ; in, This represents the attention coefficient in frame t. This represents the natural exponential function. This indicates that g performs a fully connected layer mapping. This represents the global average pooling feature vector. This indicates that a fully connected layer is used to map the frame features. Let represent the frame-level acoustic feature vector obtained in frame t. This indicates the total number of frames in the valid speech segments. Indicates the first Frame-level acoustic features of valid frames; S34. Weighted aggregation of frame-level features yields the speaker embedding vector: ; in, This represents the speaker embedding vector. This indicates the total number of frames in the valid speech segments. This represents the attention coefficient in frame t. Let represent the frame-level acoustic feature vector obtained in frame t.

5. The end-to-end speech recognition method based on an improved Transformer structure according to claim 1, characterized in that, Step S4 specifically includes: S41. Fuse the speaker embedding vector with the feature sequence of the effective speech segment to obtain the fused frame features: ; in, Indicates the features of the fused frame. This represents the concatenation of eigenvectors. Represents the feature sequence of a valid speech segment. This represents the speaker embedding vector; S42. Perform linear transformation and normalization on the fused frame features to obtain the final... Dimensional fused speech feature sequence containing speaker information ,in, , For the embedded vector dimension, For the original dimension, This represents the total dimension after merging.

6. The end-to-end speech recognition method based on an improved Transformer structure according to claim 1, characterized in that, Step S5 specifically includes: S51. Input the fused feature sequence into a multi-layer stacked improved Transformer encoder, using a three-layer hierarchical Transformer encoder, each layer including a local sensitive multi-head self-attention sub-layer and a feedforward neural network sub-layer. S52. Add relative position encoding to the features of each frame. The encoded input is represented as follows: EncInput ; ; Among them, EncInput Let t be the encoder input vector for the t-th frame. Let be the feature vector after fusion in frame t. For encoding dimensions, For the first The position vector of the frame; S53. In the multi-head self-attention calculation of each layer, construct the query matrix. Key matrix Value matrix ; S54. For each frame, limit the attention calculation range to a local window of k frames to the left and right of that frame, and calculate the local sensitive multi-head attention: ; in, For the first Frame query vector, This represents the transpose matrix of the key vectors from frame i−k to i+k within the local window. Let be the dimension of the key vector. Let i be the set of value vectors from i−k to i+k within the window, and let the non-window regions be set to zero using a mask. This indicates a normalization operation; S55. Concatenate the outputs of all attention heads to obtain multi-head attention output. After linear transformation, residual connection and layer normalization, multi-head attention sub-layer output is formed. S56. Input the output of the multi-head attention sublayer into the feedforward neural network. The feedforward neural network includes two fully connected layers and a nonlinear activation function. The output is processed by residual connection and layer normalization to obtain the encoding result of the current layer. S57. Use the encoding result as the input for the next layer, and repeat the operation until the last layer to obtain the encoded feature sequence. .

7. The end-to-end speech recognition method based on an improved Transformer structure according to claim 1, characterized in that, Step S6 specifically includes: S61. Embed the text: ; in, Let represent the embedding vector of the i-th generated word. Represents the real number field. Word embedding dimension; Speaker embedding and decoding position encoding The vectors are fused together to form the decoder input vector. ; ; in, Let represent the input vector of the decoder at time step i. Let represent the embedding vector of the i-th generated word. , This represents the decoder position encoding vector, where i is the index of the current decoding step and d is the encoding dimension; S62. Perform contextual interaction between the input vector and the encoded feature sequence to calculate cross-attention: ; in, This represents the computational result of the cross-attention mechanism. Represents the query matrix. Represents the encoded feature matrix, This indicates a normalization operation. H represents the encoder output as the key matrix, and T represents the total number of frames in the input sequence. Indicates the dimension of the key vector; S63. Input the cross-attention result into the feedforward neural network to obtain the prediction output vector at the current time step; S64. Calculate the probability distribution of the next target word using linear transformation and the Softmax function: CrossAttention_out ; in, Given the first i words Predict the next word under the given conditions. The probability distribution, This indicates a normalization operation. The output weight matrix is ​​FFN(CrossAttention_out), which is the output of the feedforward neural network. This is the output bias vector; S65. Based on the probability distribution, retain candidates as the results generated in the current step, and use the embedding vector of the results generated in the current step as the input for the next step. Recursively generate a sequence end marker to obtain the complete target text sequence. .

8. The end-to-end speech recognition method based on an improved Transformer structure according to claim 1, characterized in that, Step S7 specifically includes: S71. In the autoregressive decoding process, a bundle search is performed on the candidate set of the target text sequence, the bundle width B is set, and the first B candidate sequences are retained according to the probability of the candidate sequences. S72. Calculate the score for the candidate sequence and re-score it, and determine the output sequence according to the re-score result; S73. Perform punctuation recovery on the output sequence and insert punctuation marks into the word sequence according to the preset punctuation recovery model; S74. Perform duplicate word deletion on the text after punctuation restoration, remove continuously repeated words or meaningless filler words, and output the final recognized text result.

9. The end-to-end speech recognition method based on an improved Transformer structure according to claim 1, characterized in that, Step S8 specifically includes: S81. Calculate the loss function of the VAD model as the binary cross-entropy: ; in, This represents the loss function of the VAD model. This indicates that the total time steps T are normalized. This represents the frame label sequence corresponding to time t. Let be the speech probability of the frame corresponding to time t. Represents a speech frame. Indicates a silent frame; S82, reduce the endpoint detection loss With speech recognition loss and the speaker distinguishes the loss By weighting factor Weighted summation yields the total loss for joint training: ; in, Indicates the total losses from joint training. The weights are non-negative real numbers, corresponding to the speech recognition loss respectively. Endpoint detection loss Weighting of loss from the speaker ; S83, Based on the total loss The parameters of the VAD endpoint detection model, the CAM++ speaker differentiation module, and the improved Transformer model are updated through backpropagation. S84. Determine whether the training has converged according to the preset convergence criterion. If converged, output and save the trained model; otherwise, continue iterative execution.