A pluggable target speaker voice recognition method and system
By adopting a pluggable two-stage training architecture with direct connection to the feature domain, the problems of low recognition rate and poor adaptability in multi-speaker speech recognition are solved, achieving efficient and flexible target speaker speech recognition, reducing computing power overhead and retaining the general recognition capability of the ASR model.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- XIAMEN LIMAYAO NETWORK TECH CO LTD
- Filing Date
- 2026-04-02
- Publication Date
- 2026-06-23
AI Technical Summary
Existing multi-speaker speech recognition technologies have low recognition rates when dealing with complex and overlapping speech, and are difficult to adapt to different ASR models, resulting in decreased recognition performance and wasted resources.
A pluggable two-stage training architecture with direct connection in the feature domain is adopted. An adaptive voiceprint network with multi-layer convolution + RMSnorm + MeanPooling is used to extract fixed-dimensional global timbre embedding vectors. Combined with L1 loss, physical alignment training is completed to generate feature preprocessing and voiceprint embedding information adapted to ASR. The ASR parameters are frozen and only the front-end module is fine-tuned to achieve semantic alignment between features and ASR model.
It significantly reduces the word error rate under overlapping speech, improves the industrial compatibility and deployment flexibility of the system, retains the general recognition capability of the pre-trained ASR model, and reduces computing power overhead.
Smart Images

Figure CN121963713B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of data processing technology, and in particular to a pluggable target speaker speech recognition method and system. Background Technology
[0002] Currently, there are two main technical approaches in the field of multi-speaker speech recognition: cascaded solutions and end-to-end solutions. Their specific implementations and applications are as follows: The core logic of the cascaded solution is to first extract the target audio waveform from the mixed audio using an independent speech separation model, and then feed the extracted waveform data into a downstream general ASR (Automatic Speech Recognition) model for semantic recognition. In the cascaded solution, the training objective of the speech separation module focuses only on the reconstruction of the target audio waveform, without considering the semantic recognition requirements of the backend ASR model. This difference in task objectives makes it difficult for the system to generate features suitable for ASR semantic recognition when processing complex overlapping speech, ultimately resulting in a low recognition rate.
[0003] End-to-end solutions require full fine-tuning of the model to adapt to multi-speaker scenarios. This process inevitably undermines the general recognition capabilities learned by the pre-trained model on large-scale data, leading to a decline in the model's recognition performance in single-speaker scenarios and the "catastrophic forgetting" phenomenon. The architecture of end-to-end solutions is strongly tied to specific ASR models, requiring redesign of the network structure for different ASR models, making flexible cross-architecture adaptation impossible. In cascaded solutions, the interface between the speech separation module and the ASR model lacks standardized design, resulting in high adaptation costs and making it difficult to meet the compatibility requirements of different mature ASR systems in industrial scenarios.
[0004] In addition, existing models generally need to process the speech information of all speakers in the mixed audio at the same time. However, in scenarios such as interactive digital humans and interviews with specific people, users only focus on the speech content of the target speaker. The design of full recognition leads to unnecessary consumption of computing resources and cannot achieve lightweight deployment. Summary of the Invention
[0005] To solve the above-mentioned technical problems, the present invention provides the following technical solution:
[0006] A pluggable target speaker speech recognition method includes: acquiring the target speech recognition requirements in multi-speaker scenarios and a pluggable two-stage training architecture with directly connected feature domains to adapt to these requirements; performing feature preprocessing on mixed audio, target reference audio, and clean target audio, uniformly converting them into the Mel spectral feature format specified by the downstream ASR, extracting a fixed-dimensional global timbre embedding vector through a multi-layer convolutional + RMSnorm + MeanPooling adaptive speaker network to generate feature preprocessing and speaker embedding information adapted to ASR; constructing a multi-layer adaptive convolutional encoding extraction network, copying the timbre vector along the time axis, modulating the mixed audio features layer by layer using AdaLayer to suppress non-target components, restoring it to the original Mel spectral dimension through the output projection layer, and completing physical alignment by combining L1 loss. Training generates feature modulation fusion and reconstruction information with basic target extraction capabilities; the extraction module is directly cascaded with the pre-trained ASR feature domain, freezing all ASR parameters so that gradients flow only at the front end; target features are obtained by inputting mixed and reference audio; according to the ASR architecture, CTCLoss is selected for CTC and cross-entropy loss is selected for Encoder-Decoder, generating semantic loss feedback and parameter optimization information that only updates the front end; the output dimension of the extraction module is aligned with the input dimension of ASR, generating feature dimension matching and lightweight fine-tuning pluggable adaptation information that can seamlessly adapt to mainstream ASRs; relying on the task consistency collaborative fine-tuning mechanism of frozen ASR, ASR is used as a static semantic discriminator to generate target speech semantic recognition and accurate extraction information with low computational overhead and high recognition rate.
[0007] A pluggable target speaker speech recognition system, wherein the system is used to execute executable instructions to perform the aforementioned pluggable target speaker speech recognition method.
[0008] Its beneficial effects are as follows: Through a second-stage joint training, with ASR parameters frozen, this invention uses the recognition loss function to guide the front-end extraction module, ensuring that the extracted features are semantically aligned with the ASR model, significantly reducing the word error rate under overlapping speech. By freezing the back-end ASR parameters, only the lightweight front-end extraction module needs fine-tuning. Since the weights of the ASR model itself are not altered, the original general recognition capability of the pre-trained ASR model is preserved.
[0009] Furthermore, this invention employs a pluggable architecture with direct connection of feature domains. Since the output dimension of the extraction module is pre-defined to be consistent with the input dimension of the ASR and does not depend on a specific ASR internal structure, this extraction module can be used as an independent plug-in, seamlessly adapting to mature ASR systems with various mainstream architectures, greatly improving the system's industrial compatibility and deployment flexibility. Attached Figure Description
[0010] Figure 1A flowchart illustrating a pluggable target speaker speech recognition method provided in an embodiment of the present invention;
[0011] Figure 2 This is a schematic diagram of a pluggable target speaker speech recognition system provided in an embodiment of the present invention. Detailed Implementation
[0012] The preferred embodiments of the present invention will be described below with reference to the accompanying drawings. It should be understood that the preferred embodiments described herein are for illustration and explanation only and are not intended to limit the present invention. Figure 1 This application describes a pluggable target speaker speech recognition method according to exemplary embodiments thereof. In one embodiment, this application also proposes a pluggable target speaker speech recognition method.
[0013] In this application embodiment, a pluggable target speaker speech recognition method is provided, such as... Figure 1 As shown:
[0014] S101, to obtain the target speech recognition requirements in multi-speaker scenarios and the two-stage training architecture with pluggable feature domains that adapt to these requirements.
[0015] In one implementation, core target requirements are extracted from real-world application scenarios of multi-speaker speech recognition (such as interactive digital humans, interviews with specific individuals, and conference recordings). These core requirements revolve around four dimensions: recognition performance, model adaptability, computational cost, and retention of general capabilities. Specifically, these requirements are: recognizing only the speech information of the target speaker in the mixed audio, avoiding the waste of computational resources caused by recognizing all speakers; solving the problem of low recognition rate of overlapping speech caused by the inconsistency between speech separation and ASR semantic recognition task objectives in traditional cascaded solutions; avoiding the degradation of pre-trained ASR general recognition capabilities caused by full-scale model fine-tuning in traditional end-to-end solutions, preventing "catastrophic forgetting"; and improving the adaptability and flexibility of the model architecture, eliminating the need to redesign the architecture for specific ASR models and enabling seamless integration with various mainstream pre-trained ASR systems.
[0016] Based on the above scenario requirements, the corresponding technical indicator requirements are clearly defined, including reducing the word error rate in overlapping speech scenarios and achieving accurate recognition of the target speaker's speech; the model front-end extraction module can be used as an independent plugin, seamlessly adapting to ASR models with different architectures such as CTC and Encoder-Decoder; only the lightweight front-end module needs to be fine-tuned, without the need for a full update of model parameters, significantly reducing the computational cost of training and inference; and the original general recognition and language modeling capabilities of the pre-trained ASR model are retained to adapt to the recognition needs of different single / multiple speaker scenarios.
[0017] In interactive digital human scenarios, users only need the digital human to recognize the voice commands of a specific user, without needing to recognize the voices of other people in the environment. Therefore, the requirements are "accurate extraction of the voice of a specific target speaker + real-time recognition with low computing power + no need to reconstruct the existing ASR model of the digital human". Based on this, the technical indicators extracted are "reduced word error rate under overlapping speech, the extraction module can be plugged and adapted to the existing ASR of the digital human, and only the front-end module needs to be fine-tuned".
[0018] To address the aforementioned extraction requirements, two core design principles are established: direct feature domain connection and pluggable design. Simultaneously, the auxiliary design principles of freezing the backend ASR and fine-tuning only the frontend are clarified, providing core guidelines for architecture construction. The specific requirements of each principle are as follows: The direct feature domain connection principle breaks the traditional cascaded scheme's "feature-waveform-feature" reconstruction and transformation limitations. All module processing in the architecture is completed within the feature domain required by the downstream ASR (such as the Mel spectrum feature domain). The output of the frontend target feature extraction module directly connects to the input of the pre-trained ASR model in the feature domain, eliminating the need for waveform reconstruction and secondary feature extraction, thus ensuring the consistency and efficiency of feature transfer.
[0019] The front-end target feature extraction module is an independent modular design. Its output feature dimension is pre-defined to match the input dimension of mainstream pre-trained ASR models. Furthermore, the module design does not depend on the internal structure, hierarchy, or parameters of any specific ASR model, allowing it to be directly integrated into / out of the ASR system as an independent plugin, achieving a "plug-and-play" adaptation effect. All weight parameters of the pre-trained ASR model are frozen throughout the architecture; only the lightweight front-end target feature extraction module undergoes parameter fine-tuning and updates. Gradients flow only within the front-end module, ensuring that the general capabilities of the pre-trained ASR are not compromised, while simultaneously reducing training computational power and complexity.
[0020] To address the need for compatibility with both Whisper (Encoder-Decoder architecture) and Wav2Vec2.0 (CTC architecture) ASR models in conference voice recording scenarios, the front-end extraction module is designed with the pluggable principle in mind. Its output dimension is aligned with the input dimension of both ASR models. Furthermore, the module does not contain any proprietary structures of Whisper or Wav2Vec2.0, allowing it to be directly integrated into both ASR systems without additional modifications.
[0021] Based on the above requirements analysis and design principles, a pluggable two-stage training architecture with direct feature domain connection is constructed. This architecture is a cascaded structure of an independent front-end module and a frozen back-end ASR module. The core consists of two main parts: a front-end target feature extraction module and a back-end pre-trained ASR module. The training process is divided into two stages: the first stage, pre-training of the target feature extraction module, and the second stage, task consistency collaborative fine-tuning based on frozen ASR. The module composition, hierarchical relationship, connection method, and training stage plan of the architecture are as follows:
[0022] The front-end target feature extraction module is a lightweight, independently deployable module. It consists of four levels: feature preprocessing submodule, global speaker feature extraction submodule, multi-layer adaptive convolutional coding submodule, and feature restoration and alignment submodule. Each level is sequentially connected and completes audio feature conversion, target timbre extraction, hybrid audio feature modulation, feature restoration and loss training in sequence. The overall output of the module is the target speaker's speech features in the feature domain.
[0023] For the backend pre-trained ASR module, various mainstream and mature pre-trained ASR models are used (supporting architectures such as CTC and Encoder-Decoder). As the semantic discrimination module of the architecture, its parameters are kept frozen throughout the process, and it does not participate in any parameter updates. It only receives the output features of the frontend module and completes semantic recognition, outputting the text prediction result. The output of the frontend target feature extraction module is directly connected to the input of the backend pre-trained ASR module in the feature domain, without any intermediate transformation modules. The output feature dimension is strictly consistent with the ASR input dimension, forming a seamless cascade.
[0024] The training process of the architecture is divided into two stages, executed sequentially. The training results of the first stage serve as the input basis for the second stage. The training objectives, training objects, and core requirements of the two stages are clearly distinguished, and the specific plan is as follows: First stage: Pre-training of the target feature extraction module: The training object is all parameters of the front-end target feature extraction module, and the back-end ASR module does not participate in the training for the time being; the training objective is to enable the front-end module to acquire basic target speaker feature extraction capabilities, realize the separation of target features and non-target features in mixed audio, and complete the physical alignment training of features through L1 loss.
[0025] The second stage involves collaborative fine-tuning based on frozen ASR for task consistency: the training object is only the parameters of the front-end target feature extraction module, while the parameters of the back-end ASR module are frozen throughout the entire process; the training objective is to solve the "task objective inconsistency" problem, ensuring that the features extracted by the front-end module are semantically aligned with the ASR model. The semantic loss function corresponding to ASR is used to guide the parameter updates of the front-end module, achieving semantic alignment training of features. All processing in both stages is completed in the feature domain, and the output dimension of the front-end module remains consistent with the input dimension of the back-end ASR throughout the entire training process, ensuring that the pluggable nature of the architecture does not change during training.
[0026] During the architecture development process, pluggable features are integrated into the entire process of module design and training. Specific implementation requirements are as follows: The input / output interfaces of the front-end target feature extraction module are standardized: the dimensions and format of the output features are designed uniformly according to the input requirements of mainstream ASR models, supporting standardized access to ASRs with different architectures such as CTC and Encoder-Decoder. The front-end module and the back-end ASR module are designed without coupling: the training and inference of the front-end module do not depend on any internal structure, hierarchy, or parameters of the back-end ASR; data interaction is only conducted through standardized feature interfaces. After training, the front-end module is an independent deployment unit that can be directly detached from the currently cascaded ASR system and connected to other pre-trained ASR systems with matching input dimensions, without retraining, directly endowing the new ASR system with target speaker recognition capabilities.
[0027] For specific interview scenarios, a two-stage training architecture was built. The output dimension of the front-end target feature extraction module and the input dimension of the existing Wav2Vec2.0 (CTC architecture) ASR model in the interview system were both set to 80-dimensional Mel-spectral features, and the modules were directly connected in the feature domain. In the first stage, the front-end module was pre-trained to extract the interviewee's speech features from the mixed audio of the interview. In the second stage, all parameters of Wav2Vec2.0 were frozen, and the front-end module was collaboratively fine-tuned through CTCLoss to achieve semantic alignment. After training, the front-end module can be directly separated from Wav2Vec2.0 and connected to the interview system's backup Whisper (Encoder-Decoder architecture) ASR model (also with an 80-dimensional input dimension) without retraining, enabling Whisper to achieve target speech recognition of the interviewee.
[0028] S102 performs feature preprocessing on the mixed audio, target reference audio, and clean target audio, and converts them into the Mel spectral feature format specified by the downstream ASR. The fixed-dimensional global timbre embedding vector is extracted through an adaptive voiceprint network of multi-layer convolution + RMSnorm + MeanPooling to generate feature preprocessing and voiceprint embedding information adapted to ASR.
[0029] In one implementation, two core mechanisms are introduced to address the core adaptation requirement of direct feature domain connection in multi-speaker scenarios (i.e., front-end feature extraction does not require waveform reconstruction and can directly interface with ASR input). First, for Mel-spectrum feature conversion rules, a unified standard for audio feature conversion is defined to ensure that the converted feature format fully matches the input requirements of the downstream ASR model, avoiding adaptation problems caused by incompatibility in feature dimensions or formats. Second, for the adaptive speaker feature extraction mechanism, a timbre feature extraction logic is established for the target reference audio, ensuring accurate capture of globally representative target speaker timbre information from the reference audio, providing a basis for selecting target features in subsequent mixed audio.
[0030] Based on the above mechanism, a standardized preprocessing scheme is determined for mixed audio, target reference audio, and clean target audio. The core requirements of the scheme are "three unifications": unified feature conversion format (Mel spectrum), unified feature dimension (consistent with the downstream ASR input dimension), and unified processing flow (to avoid feature deviations caused by different audio types). In a specific interview scenario, the downstream ASR model is Whisper (Encoder-Decoder architecture), and its input feature requirement is an 80-dimensional Mel spectrum. Based on this, the preprocessing scheme clarifies that all three types of audio must be converted to 80-dimensional Mel spectrum features, and the voiceprint extraction of the target reference audio must adapt to this feature format to ensure that the extracted voiceprint vector can be effectively fused with the 80-dimensional features of the subsequent mixed audio.
[0031] Feature preprocessing is performed on the mixed audio, target reference audio, and clean target audio, converting all three types of audio into the Mel spectral feature format specified by the downstream ASR model. The source and purpose of the three types of input audio are clearly defined to ensure accurate preprocessing. Specifically, the mixed audio contains overlapping speech from the target speaker and at least one non-target speaker, and can be mixed with real-world environmental noise (such as background noise in a conference room or outdoor noise) depending on the actual scenario. This is the core processing object for subsequent target feature extraction. For the target reference audio, independent speech segments of the target speaker are extracted from the dataset, without mixing in other speakers' voices, and are used to extract the target speaker's unique voiceprint features. For the clean target audio, the pure audio track of the target speaker in the mixed audio, without non-target speaker voices or noise, is used for subsequent physical alignment of features and the correct labels.
[0032] The Mel spectrum conversion algorithm is used to perform feature conversion on the three types of audio, ultimately outputting Mel spectrum features in a unified format. During the conversion process, the requirements of the downstream ASR model for features are strictly followed to ensure that parameters such as sampling frequency, frame length, and frame shift are consistent with the ASR input standard, avoiding performance loss due to parameter differences. If the downstream ASR model requires a Mel spectrum sampling frequency of 16kHz, a frame length of 25ms, and a frame shift of 10ms, then all three types of audio are preprocessed according to these parameters. The mixed audio consists of an audio segment containing the target speaker, two non-target speakers, and office background noise; the target reference audio is a few tens of seconds of pure speech read by the target speaker alone; and the clean target audio is the pure audio track of the target speaker separated from the mixed audio. After conversion, all three yield an 80-dimensional Mel spectrum feature matrix.
[0033] An adaptive voiceprint feature extraction network was constructed, consisting of a multi-layer convolutional encoder, RMSnorm root mean square normalization, MeanPooling temporal averaging aggregation, and a linear projection layer, with a fixed hidden layer mapping dimension. A four-layer core structure was also constructed, with each layer connected sequentially according to the logic of "input → encoding → normalization → aggregation → projection," as shown in the following structure:
[0034] For the input layer, the normalized target reference audio Mel spectrum features are received, and the input dimension is consistent with the Mel spectrum feature dimension (e.g., 80 dimensions).
[0035] For the multi-layer convolutional encoder, which is the core encoding module of the network, it consists of multiple convolutional layers. It captures local timbre features and global texture information in the Mel spectrum of the reference audio through convolutional operations. The kernel size and stride of each convolutional layer are set according to the feature extraction requirements to ensure that high-level timbre features can be extracted step by step.
[0036] For the RMSNorm root mean square normalization layer, the features after convolutional encoding are normalized to eliminate the training instability caused by differences in feature distribution and improve the network's accuracy in extracting timbre features.
[0037] For the MeanPooling time-axis average aggregation layer, the normalized features are averaged and aggregated along the time axis to compress the time dimension information, retain the globally representative timbre features, and generate a dimensionally compressed feature vector.
[0038] For the linear projection layer, which serves as the network's output layer, it maps the aggregated feature vectors to a pre-defined fixed hidden layer dimension, thus standardizing the feature dimensions. A fixed hidden layer dimension (512 dimensions) is set for the hidden layer mapping dimension; this parameter is the network's core output parameter, ensuring that the generated speaker embedding vectors have a uniform dimension, facilitating subsequent fusion with mixed audio features. For the convolutional layer parameters, the number of convolutional kernels, kernel size, and stride of each convolutional encoder layer are configured according to the timbre feature extraction requirements, ensuring the gradual capture of timbre features from low to high levels. When no specific parameters are specified, industry-standard settings are used (e.g., an initial number of convolutional kernels of 64, a kernel size of 3×3, and a stride of 1).
[0039] The constructed adaptive voiceprint network follows a structure of "input layer (80-dimensional) → 5-layer 1D convolutional encoder → RMSnorm layer → MeanPooling layer → linear projection layer (512-dimensional)"+. The core parameters of the 5-layer 1D convolutional encoder are uniformly configured as kernel_size=7 and stride=1. The number of convolutional kernels is gradually increased according to feature extraction requirements to enhance the high-level timbre feature capture capability. The hidden layer dimension of each layer is set to 512, with an intermediate extended dimension of 2048. Adapting the extended dimension to the core dimension enhances the network's feature representation capability. After processing by the 5-layer 1D convolutional encoder, the features are sequentially normalized by RMSnorm, averaged and aggregated along the MeanPooling time axis, and finally mapped to the 512-dimensional hidden layer through the linear projection layer. This ensures that the generated global voiceprint embedding vector has a uniform dimension and possesses accurate target timbre representation capabilities.
[0040] The standardized Mel-spectral features of the target reference audio (e.g., an 80-dimensional feature matrix) are input into the adaptive voiceprint feature extraction network. The temporal dimension of this feature strictly corresponds to the duration of the reference audio (e.g., 30 seconds of audio corresponds to 3000 frames of features). After being processed by multiple layers of convolutional encoding, normalization, and aggregation, the input features are mapped to a preset hidden layer dimension through a linear projection layer, ultimately generating a global voiceprint embedding vector.
[0041] The input features are processed sequentially by a multi-layer convolutional encoder. Each convolutional layer extracts local timbre features through convolution operations, progressively generating a high-level abstract timbre feature vector. The encoded features are then fed into an RMSnorm layer, where they are normalized using the root mean square normalization formula to ensure a stable feature distribution. The normalized features are then fed into a MeanPooling layer, where the features of all frames are averaged along the time axis to obtain a global feature vector that does not contain the time dimension. The aggregated global features are then fed into a linear projection layer, where a linear transformation maps the feature dimensions to a preset hidden layer dimension (e.g., 512 dimensions), completing the network processing.
[0042] A fixed-dimensional global speaker embedding vector is generated from the feature processing results of the target reference audio. From the network processing results in step four, the output vector of the linear projection layer is directly extracted; this vector is the fixed-dimensional global speaker embedding vector. This vector is the core representation of the target speaker's timbre, containing the speaker's unique timbre information. Its dimension is consistent with the preset hidden layer mapping dimension (e.g., 512 dimensions) and does not change with the duration of the input audio, thus possessing global representativeness. The target reference audio is a few tens of seconds of pure speech from the target speaker. After network processing, a 512-dimensional global speaker embedding vector is obtained. This vector can accurately distinguish the timbre differences between the target speaker and other speakers, and can subsequently be used to filter out the target speaker's features from the mixed audio.
[0043] This process integrates Mel-spectral normalized features from multiple audio types with the target global speaker embedding vector to generate feature preprocessing and speaker embedding information adapted to the input requirements of downstream ASR models. The Mel-spectral normalized features of the three audio types (80-dimensional Mel spectra of mixed audio, target reference audio, and clean target audio) are integrated with the 512-dimensional global speaker embedding vector extracted in step five to form a unified feature set. This set must meet the input requirements of the downstream ASR model, containing both the original feature information of various audio types and the timbre representation information of the target speaker. It can then be directly used as input to a multi-layer adaptive convolutional coding extraction network, providing data support for the modulation and fusion of target features.
[0044] The integrated feature set includes "80-dimensional Mel spectrum feature matrix of mixed audio, 80-dimensional Mel spectrum feature matrix of target reference audio, 80-dimensional Mel spectrum feature matrix of clean target audio, and 512-dimensional global voiceprint embedding vector". This set does not require additional format conversion and can be directly input into the subsequent target extraction network to achieve the fusion modulation of timbre vector and mixed audio features.
[0045] S103 constructs a multi-layer adaptive convolutional coding extraction network. The timbre vector is copied along the time axis and then modulated layer by layer by AdaLayer to mix audio features and suppress non-target components. The output projection layer restores it to the original Mel spectrum dimension. Combined with L1 loss, physical alignment training is completed to generate feature modulation fusion and restoration information with basic target extraction capabilities.
[0046] In one implementation, combining the feature domain direct connection adaptation requirements of target speech recognition in multi-speaker scenarios with the two-stage training logic, a backend ASR parameter freezing mechanism and a structured loss adaptation rule are introduced to build a collaborative fine-tuning system between the extraction module and the pre-trained ASR. Combining the feature domain direct connection adaptation requirements of multi-speaker scenarios (frontend features can be directly connected to ASR without waveform conversion) and the two-stage training logic (first-stage physical alignment, second-stage semantic alignment), two core mechanisms are introduced, as follows:
[0047] Regarding the backend ASR parameter freezing mechanism, the rule of not updating the ASR model parameters throughout the process is clearly defined to ensure that the general recognition and language modeling capabilities obtained through pre-training are not destroyed and to avoid "catastrophic forgetting".
[0048] For the architectural loss adaptation rules, a loss function matching standard corresponding to the ASR architecture is established to ensure that the semantic loss calculation can accurately reflect the difference between the extracted features and the semantic requirements of ASR.
[0049] The collaborative fine-tuning system centers on a "front-end target feature extraction module + back-end frozen ASR module," which are directly connected and cascaded through feature domains. The core objective of the system is to enable the front-end extraction module to learn the "semantic preferences" of the ASR, aligning the extracted target features semantically with the ASR requirements. In interactive digital human scenarios, the pre-trained ASR on the digital human is either Wav2Vec2.0 (CTC architecture) or Whisper (Encoder-Decoder architecture), which needs to recognize only the voice commands of specific users. Based on this, the collaborative fine-tuning system introduces an ASR parameter freezing mechanism (locking all parameters of Wav2Vec2.0 or Whisper) and a structured loss adaptation rule (matching CTCLoss) to ensure the system adapts to the semantic recognition needs of digital human scenarios.
[0050] Following the cascading specification of direct feature domain connection, the output of the target feature extraction module trained in the first stage is directly connected to the input of the pre-trained ASR model, completing the end-to-end cascading adaptation between the extraction module and the ASR model. Adhering to the direct feature domain connection cascading specification, the core requirement is "no intermediate transformation and strict dimensional alignment": the output of the front-end target feature extraction module is directly connected to the input of the pre-trained ASR model, without the need for waveform reconstruction or additional feature transformation steps, ensuring the efficiency and consistency of feature transfer.
[0051] Taking the ASR model Wav2Vec2.0 (CTC architecture) as an example, the target feature extraction module trained in the first stage (which already has basic target feature extraction capabilities) is used as the front end and concatenated end-to-end with the pre-trained ASR model (which supports CTC architecture). During concatenation, it is necessary to ensure that the dimension and time axis length of the front end output features completely match the ASR input requirements to avoid semantic recognition deviations caused by dimension mismatch.
[0052] The front-end extraction module, which has completed the first stage of training, outputs 80-dimensional Mel spectrum features (consistent with the input dimension of the downstream ASR). Its output is directly connected to the input of Wav2Vec2.0 (CTC architecture) without any intermediate conversion modules, realizing seamless cascading of feature domains and ensuring that the extracted target features can be directly received and processed by the ASR.
[0053] Based on the core requirement of task consistency and collaborative fine-tuning, two major parameter constraint rules are defined. First, all weight parameters of the pre-trained ASR model are frozen, disallowing any parameter updates during training. In this state, the ASR module acts solely as a "static semantic discriminator," outputting text prediction results and providing a semantic loss benchmark. Second, during training, gradients are limited to flowing only within the front-end target feature extraction module and are not propagated back to the ASR module, ensuring that the original general capabilities of the ASR are unaffected by fine-tuning.
[0054] By constraining parameters, the front-end extraction module becomes the sole object of parameter updates, forcing it to learn how to adjust its feature extraction logic and reduce feature distortions that lead to mishearing in ASR. This ensures that the extracted target features better match the semantic recognition requirements of ASR. The cascaded ASR is Wav2Vec2.0 (CTC architecture), with parameters of all its network layers (including convolutional layers, fully connected layers, and CTC output layers) frozen. During training, only the parameters of the convolutional coding layer and AdaLayer modulation layer of the front-end extraction module are allowed to be updated, and gradients are only backpropagated within the front-end module, ensuring that the general recognition capability of Wav2Vec2.0 remains unaffected.
[0055] The mixed audio and target reference audio are input into the cascaded system. After the front-end extraction module outputs the target speaker features, they are fed into the frozen ASR model to generate text prediction results. For different ASR architectures such as CTC and Encoder-Decoder, CTCLoss and cross-entropy loss are used to calculate semantic loss respectively. Two types of core data are input into the collaborative fine-tuning system, as follows: the mixed audio includes the speech of the target speaker and non-target speakers, and can be mixed with real-world noise (such as background noise in a conference room) to simulate real-world scenarios with multiple speakers. The target reference audio consists of independent speech segments of the target speaker, used to assist the front-end module in accurately locating target features.
[0056] After the input data is processed by the front-end target feature extraction module, the target speaker features (80-dimensional Mel spectrum format) are output. These features are directly fed into the frozen ASR module. The ASR module performs semantic recognition based on the pre-trained parameters and outputs the text prediction result corresponding to the target speaker.
[0057] Depending on the ASR architecture type, the corresponding loss function is selected to calculate the semantic loss: if the ASR is a CTC architecture (such as Wav2Vec2.0), CTCLoss is used to calculate the loss. If the ASR is an Encoder-Decoder architecture (such as Whisper), cross-entropy loss is used to calculate the loss. If the ASR is Whisper (Encoder-Decoder architecture), the input mixed audio is "target speaker's instructions + two non-target speaker chatter + office noise", and the reference audio is a single instruction segment recorded by the target speaker. After the front-end module outputs the target features, Whisper outputs the text prediction result. The difference between this prediction result and the actual instruction text is calculated using cross-entropy loss to obtain the semantic loss value.
[0058] By backpropagating the semantic loss gradient, only the model parameters of the front-end extraction module are updated, generating semantic loss feedback and parameter optimization information that only requires minor fine-tuning of the front end, thus achieving semantic alignment between the extracted features and the ASR model. The semantic loss value calculated in the fourth step is used as an optimization signal and backpropagated along the network layers of the front-end target feature extraction module, updating only the model parameters of this module, including the convolution kernel parameters of the multi-layer convolutional coding layers, the learnable γ and β parameters of the AdaLayer modulation layer, and the dimension mapping parameters of the output projection layer.
[0059] Through multiple rounds of iterative training, the front-end extraction module continuously adjusts its feature extraction logic, gradually reducing the feature components that cause ASR semantic recognition bias. This ensures that the extracted target features are not only consistent with clean target features at the physical level (spectral format) but also aligned with ASR recognition requirements at the semantic level (such as keywords and grammatical structures), ultimately achieving consistency between the task objectives of "feature extraction - ASR semantic recognition". After 100 rounds of iterative training, the target features extracted by the front-end module are fed into the frozen Wav2Vec2.0 (CTC architecture). The word error rate of its output text prediction results is significantly reduced compared to the real text, indicating that the extracted features have accurately captured the semantic information required for ASR, achieving semantic-level alignment.
[0060] S104 directly connects the extraction module to the pre-trained ASR feature domain, freezes all ASR parameters so that the gradient flows only at the front end, inputs mixed and reference audio to obtain target features, selects CTCLoss for CTC and cross-entropy loss for Encoder-Decoder according to the ASR architecture, and generates semantic loss feedback and parameter optimization information that only updates the front end.
[0061] In one implementation, combining the pluggable adaptation requirements of target speech recognition in multi-speaker scenarios with the direct feature domain connection architecture logic, strict dimensional alignment rules and lightweight fine-tuning mechanisms are introduced to optimize the adaptability of the target feature extraction module. Combining the pluggable adaptation requirements of multi-speaker scenarios (modules can be integrated with different ASRs without modification) and the direct feature domain connection architecture logic (no waveform reconstruction, direct feature docking), two core mechanisms are introduced: establishing a precise matching standard between the module's output dimension and the ASR's input dimension to ensure no deviation in feature dimensions and avoid dimensional conflicts during adaptation; and clarifying the rule that the module only needs a few parameter updates to adapt to different ASRs, maintaining the module's lightweight attributes and reducing the computational overhead of adaptation.
[0062] By optimizing the target feature extraction module, it achieves the characteristics of "dimensional compatibility, architecture independence, and low computational cost," enabling seamless adaptation to different ASR architectures such as CTC and Encoder-Decoder, without altering the ASR model structure and parameters. In conference audio recording scenarios, the conference system needs to adapt to both Wav2Vec2.0 (CTC architecture) and Whisper (Encoder-Decoder architecture) ASRs. Based on this, the adaptation optimization design introduces strict dimensional alignment rules (unifying the module output dimension to 80 dimensions, consistent with the input dimensions of both ASRs) and a lightweight fine-tuning mechanism (fine-tuning only the AdaLayer modulation layer parameters of the module), ensuring seamless switching between the two ASRs.
[0063] Based on the input dimension requirements of mainstream downstream ASR models, the output dimension of the front-end target feature extraction module is precisely aligned to ensure a perfect match with the ASR input dimension. Based on the input dimension requirements of mainstream downstream ASR models (e.g., Mel spectral features typically have 80 dimensions), the core standard for dimension alignment is clearly defined: the dimension and time axis resolution of the module's output features must be completely consistent with the ASR input requirements, ensuring that the features can be directly received and processed by the ASR without additional dimension transformation.
[0064] The output layer of the front-end target feature extraction module undergoes dimensional calibration. By adjusting the parameters of the output projection layer, the module's output feature dimensions are mapped to values that perfectly match the ASR input dimensions. During the alignment process, it is necessary to simultaneously ensure that the time axis length of the features is consistent with the time axis requirements of the ASR input to avoid recognition errors caused by time dimension misalignment.
[0065] The downstream ASR model is Wav2Vec2.0 (CTC architecture), which requires 80-dimensional Mel spectrum as input features. The parameters of the output projection layer of the target feature extraction module are adjusted to reduce the original 128-dimensional output features of the module to 80 dimensions through linear mapping, while maintaining the consistency between the features of each frame on the time axis and the input frame length requirement of Wav2Vec2.0, thus achieving precise dimensional alignment.
[0066] Based on the design principle of direct feature domain connectivity, the extraction module retains its independent pluggable nature, making it independent of the internal structure of any specific ASR model, thus completing the core feature construction of a pluggable architecture. Based on the design principle of direct feature domain connectivity, the extraction module undergoes architectural decoupling optimization. The internal structure of the module (such as convolutional coding layers and AdaLayer modulation layers) is designed without relying on the internal architecture of any specific ASR (such as the transcription layer of CTC or the attention layer of the Encoder-Decoder), and only interfaces with ASR through standardized feature interfaces.
[0067] The module is clearly defined as having independent deployment capabilities. As an independent front-end adapter, it possesses complete target feature extraction functionality and can run without relying on any auxiliary modules of ASR. The module's input / output interfaces are standardized to ensure it can be directly connected to or disconnected from the ASR system, achieving "plug and play."
[0068] The target feature extraction module's internal structure consists of multiple adaptive convolutional coding layers and AdaLayer modulation layers, without including any Whisper (Encoder-Decoder architecture) specific attention mechanism or Wav2Vec2.0 (CTC architecture) specific transcription layer structure. The module interfaces with both ASRs via a standardized 80-dimensional Mel spectral feature interface, allowing it to be directly decoupled from the Whisper system and integrated into the Wav2Vec2.0 system without any structural modifications.
[0069] By combining the training logic with only minor adjustments to the front end, the parameter size of the extraction module is maintained at a lightweight level. This ensures that only a small amount of computing power is required to adapt to different ASRs, thus optimizing the computing power during the adaptation process. The core objective of computing power optimization is clearly defined by the training logic with only minor adjustments to the front end. Maintaining a lightweight parameter size for the module ensures that semantic alignment can be completed with only minor parameter adjustments when adapting to different ASRs, without requiring a full update of the module parameters, thereby reducing the computing power consumption during the adaptation process.
[0070] The overall parameter size of the module is designed to be lightweight (smaller than the parameter size of a pre-trained ASR). The core functional layers (such as the AdaLayer modulation layer) contain only a small number of learnable parameters, reducing the computational cost during fine-tuning. When adapting to different ASRs, only the parameters of the core functional layers of the module (such as the convolution kernel parameters of the convolutional coding layer, and the γ and β parameters of the AdaLayer) are fine-tuned, while the parameters of the remaining layers remain unchanged, further reducing computational overhead.
[0071] The target feature extraction module has a total parameter size of 5 million, which is much smaller than Whisper's 1.1 billion parameters. When adapting to Wav2Vec2.0, only 200,000 learnable parameters of the AdaLayer modulation layer in the module are fine-tuned, while the remaining 4.8 million parameters remain fixed. The computational cost of the fine-tuning process is only 4% of that of full fine-tuning, achieving efficient computational optimization.
[0072] This system generates feature dimension matching and lightweight, pluggable adaptation information that features output dimension matching, architectural decoupling independence, and low-cost fine-tuning, enabling seamless adaptation of the extraction module to various mainstream ASR models. It integrates dimension alignment results, architectural decoupling characteristics, and computational optimization parameters to generate feature dimension matching and lightweight, pluggable adaptation information containing three core pieces of information: Output dimension matching information, showing the correspondence between the module's output dimensions and the input dimensions of mainstream ASRs (e.g., 80-dimensional Mel spectrum); Architectural decoupling independence information, including design specifications that the module does not depend on the internal structure of a specific ASR; and rules for the range of fine-tuning layers and parameter update ratios when adapting to different ASRs using lightweight fine-tuning specifications.
[0073] Based on the generated pluggable adaptation information, the trained target feature extraction module is integrated into a pre-trained ASR system that matches any input dimension. During integration, no modification to the ASR's structure or parameters is required; simply connecting the module's output directly to the ASR's input end grants the ASR target speaker recognition capability, achieving seamless adaptation. The integrated pluggable adaptation information explicitly states that the module output is an 80-dimensional Mel spectrum, it does not depend on a specific ASR architecture, and only the AdaLayer parameters are fine-tuned during adaptation. Based on this information, the module can be directly integrated into Whisper (Encoder-Decoder architecture) and Wav2Vec2.0 (CTC architecture) systems without any structural modifications. Both ASRs can accurately recognize the target speaker's voice in mixed audio, achieving seamless adaptation.
[0074] S105 aligns the output dimension of the extraction module with the input dimension of ASR, generating feature dimension matching and lightweight, pluggable adaptation information that can seamlessly adapt to mainstream ASR.
[0075] In one implementation, based on a two-stage training task consistency collaborative fine-tuning mechanism and the requirements of multi-speaker scene recognition, a pre-trained ASR model with frozen parameters is used as a static semantic discriminator to generate semantic evaluation benchmark information adapted to the feature extraction module. Based on the two-stage training task consistency collaborative fine-tuning mechanism (first-stage physical alignment, second-stage semantic alignment) and the requirements of multi-speaker scene recognition (accurate extraction of target semantics, elimination of non-target interference), the core role of the semantic evaluation benchmark is clarified: to provide a stable and unified semantic judgment standard for the front-end target feature extraction module, ensuring that the extracted features meet the semantic recognition requirements of ASR.
[0076] The pre-trained ASR model with frozen parameters is used as a static semantic discriminator: the ASR model parameters are not updated throughout the process, retaining its general recognition and language modeling capabilities obtained in large-scale pre-training. It only outputs text prediction results to provide a semantic evaluation basis for the feature quality of the extraction module, avoiding semantic alignment deviation caused by fluctuations in the discrimination benchmark.
[0077] In specific interview scenarios, a pre-trained Whisper (Encoder-Decoder architecture) is selected as a static semantic discriminator, with all its parameters frozen. After receiving the target features output by the front-end module, the discriminator outputs the corresponding text prediction result. This result serves as a benchmark for evaluating the semantic effectiveness of the extracted features. If the predicted text is close to the interviewee's actual speech text, it indicates that the extracted features meet the semantic requirements.
[0078] Based on the design goals of low computational cost and high recognition rate, semantic optimization is performed on the output features of collaborative fine-tuning. The alignment requirements between the extracted features and the semantic preferences of the ASR model are clearly defined, and semantic adaptation criteria information for the target speech features is generated. Based on the design goals of low computational cost (fine-tuning only the front-end module without increasing additional computational consumption) and high recognition rate (reducing word error rate under overlapping speech), semantic optimization is performed on the output features of collaborative fine-tuning. The core is to ensure that the extracted features accurately match the semantic preferences of the ASR model, reducing feature distortions that lead to mishearing in ASR.
[0079] Clearly define the alignment requirements between extracted features and the semantic preferences of the ASR model, and form semantic adaptation criteria: features must contain clear keywords, grammatical structures and other semantic information to ensure that ASR can accurately parse the meaning of the text; features must filter out non-target semantic interference (such as the speech semantics of non-target speakers and semantic confusion caused by environmental noise); the semantic expression of features must be consistent with the recognition habits of ASR (such as CTC architecture ASR preferring continuous phoneme-level semantic features, and Encoder-Decoder architecture ASR preferring complete sentence-level semantic features).
[0080] The established adaptation criteria for the semantic preferences of Wav2Vec2.0 (CTC architecture) are as follows: the extracted features should highlight continuous semantic information at the phoneme level, filter out phoneme interference from non-target speakers, and the temporal resolution of the features should be consistent with the input requirements of Wav2Vec2.0 to ensure that ASR can efficiently decode semantics. At the same time, only the AdaLayer parameters of the front-end module are fine-tuned to control the computational cost.
[0081] Leveraging the pluggable architecture of direct feature domain connections, we set up no-modification access rules for the extraction module to adapt to various mainstream ASRs, ensuring seamless integration with ASR systems of the same input dimension during the inference phase without additional adaptation costs. The core requirement of these no-modification access rules is "zero-cost access and seamless adaptation," ensuring that the extraction module can seamlessly integrate with different types of mainstream ASR systems during the inference phase without modifying the ASR's structure, parameters, or interfaces.
[0082] The dimension and time axis length of the output features of the extraction module must be strictly consistent with the ASR input requirements (e.g., both are 80-dimensional Mel spectrum features), without the need for additional dimension conversion; the module output interface is designed according to the mainstream ASR input standard, without any dedicated interface dependency, and can be directly connected to the ASR input end; the module has complete target feature extraction function and can work independently without relying on any ASR auxiliary modules (such as feature preprocessing module, decoding module).
[0083] The established no-modification access rules explicitly extract the module output as 80-dimensional Mel spectrum features, and the interface is in a standardized feature tensor format. When adapting to Wav2Vec2.0 (CTC architecture), the module output is directly connected to the Wav2Vec2.0 input. Seamless access can be achieved without modifying the Wav2Vec2.0 network structure, parameters, or adding intermediate conversion modules, and the access process incurs no additional adaptation costs.
[0084] The semantic evaluation benchmark information, semantic adaptation criterion information, and unmodified access rules are integrated and processed to generate target speech semantic recognition and accurate extraction information with low computational overhead and high recognition rate, including discrimination benchmark, semantic alignment, and convenient access. The semantic evaluation benchmark information (evaluation standards and methods of static semantic discriminator), semantic adaptation criterion information (semantic alignment requirements for extracted features), and unmodified access rules (interface specifications between modules and ASR) generated in the above steps are integrated to form complete target speech semantic recognition and accurate extraction information, ensuring that the information covers the three core links of discrimination benchmark, semantic alignment, and convenient access, forming a logical closed loop that can be directly implemented.
[0085] The integrated information comprises three core parts, as follows: First, it clarifies the selection of the static semantic discriminator (e.g., CTC / Encoder-Decoder architecture ASR), parameter freezing requirements, and semantic evaluation methods (measuring the semantic effectiveness of features by comparing the predicted text with the actual text). Second, it clarifies the semantic adaptation criteria, optimization direction, and fine-tuning range for extracted features (core parameters of the front-end module only), ensuring semantic-level alignment with low computational overhead. Third, it clarifies the rules for seamless integration between modules and ASR, including dimensional matching requirements, interface standards, and independent operation characteristics, supporting pluggable adaptation of modules.
[0086] The integrated information explicitly selects Wav2Vec2.0 with frozen parameters as the static semantic discriminator. The effectiveness of the feature semantics is evaluated by comparing the word error rate of the text prediction results with that of the real text. Feature extraction should highlight phoneme-level semantic information and filter out non-target interference, with only minor adjustments to the front-end AdaLayer parameters. The module outputs 80-dimensional Mel spectrum features, which adopt a standardized interface and can be directly connected to CTC architecture ASRs such as Wav2Vec2.0 without any modification, ultimately achieving high-accuracy target speech semantic recognition with low computational overhead.
[0087] S106, relying on the task consistency collaborative fine-tuning mechanism of frozen ASR, uses ASR as a static semantic discriminator to generate target speech semantic recognition and accurate information extraction with low computing power and high recognition rate.
[0088] In one implementation, based on the core requirements of target speech recognition in multi-speaker scenarios, core technical elements are extracted, including frozen ASR parameters, semantic loss reverse guidance, direct connection of feature domains, and only lightweight fine-tuning of the front end. Based on the core requirements of target speech recognition in multi-speaker scenarios (solving inconsistencies in task objectives, avoiding degradation of general capabilities, improving adaptability flexibility, and reducing computational overhead), each element is ensured to accurately correspond to the core requirements, forming a closed loop of technical support.
[0089] To address the issue of "general capability degradation caused by full fine-tuning," the following measures are implemented: Freezing ASR parameters locks the parameters of the pre-trained ASR model, preventing degradation of its general recognition capabilities. Using semantic loss for backpropagation optimizes the front-end extraction module, addressing the issue of "inconsistent task objectives." Direct feature domain connection allows the front-end module to directly interface with the ASR, eliminating waveform reconstruction and improving system adaptability. Lightweight front-end tuning updates only update the front-end extraction module parameters, maintaining its lightweight nature and reducing computational overhead.
[0090] In interactive digital human scenarios, the core requirements are "recognizing only specific user commands + not affecting the digital human's original ASR capabilities + real-time response with low computing power". The four core technical elements extracted can accurately support this requirement: freezing the digital human's existing ASR parameters (preserving the original recognition capabilities), optimizing the front end through semantic loss (accurately extracting user commands), direct connection of feature domains (fast response), and fine-tuning only the front end (low computing power).
[0091] The extracted core technical elements undergo scenario-based validation to confirm their feasibility and logical closure in overlapping speech processing and multi-architecture ASR adaptation, generating validation parameters for each element. The core of this scenario-based validation is to verify the feasibility and logical closure of each element in typical multi-speaker scenarios, avoiding design flaws that could prevent the solution from being implemented. For overlapping speech processing scenarios, the validation verifies whether the elements can work collaboratively to achieve accurate extraction and semantic recognition of target features in overlapping speech. For example, it verifies whether "semantic loss reverse guidance" can effectively filter non-target semantic interference from the front-end module. For multi-architecture ASR adaptation scenarios, the validation verifies whether the elements can adapt to different ASR architectures such as CTC and Encoder-Decoder. For example, it verifies whether "direct feature domain connection" and "freezing ASR parameters" can achieve seamless adaptation without changing the ASR structure.
[0092] By verifying the applicable boundaries and collaborative logic of each element, technical element verification parameters are generated, including: the recognition accuracy threshold for element collaboration under overlapping speech, the dimensional compatibility standard for multi-architecture ASR adaptation, and the upper limit of parameter update ratio for front-end fine-tuning, providing a basis for subsequent technology integration. In the scenario of overlapping speech in a conference, the feasibility of collaboration between "freezing ASR parameters + semantic loss reverse guidance" is verified: the parameters of the conference system Wav2Vec2.0 (CTC architecture) are frozen, and the front-end module is optimized through its semantic loss to verify whether the extracted target features can control the word error rate of ASR within a preset threshold, generating verification parameters of "word error rate ≤ 15% under overlapping speech"; when adapting to Whisper (Encoder-Decoder architecture), the dimensional compatibility of "direct connection of feature domains" is verified, generating verification parameters of "output dimension 80-dimensional Mel spectrum completely matches ASR input dimension".
[0093] The task consistency collaborative fine-tuning mechanism based on frozen ASR integrates and encapsulates the validated technical elements, solidifies the pre-trained ASR model into a static semantic discriminator, and generates a standardized collaborative fine-tuning framework. Based on the frozen ASR task consistency collaborative fine-tuning mechanism, the four validated core technical elements are integrated and encapsulated to construct a standardized collaborative fine-tuning framework. The core is to clarify the role positioning and collaborative process of each element, ensuring that the framework has universality and repeatability.
[0094] The pre-trained ASR model is solidified into a static semantic discriminator. Its core responsibilities are: to output text prediction results, provide semantic loss benchmarks, not participate in parameter updates, retain its general recognition and language modeling capabilities obtained from large-scale pre-training, and provide a stable semantic evaluation standard for the front-end module.
[0095] The framework is centered around a "static semantic discriminator (frozen ASR) + front-end target feature extraction module", and clarifies the interaction logic between the two: the front-end module receives mixed audio and reference audio and outputs target features; the discriminator receives target features, outputs text prediction results and calculates semantic loss; the loss gradient is only fed back to the front-end module to drive its parameter update, forming a collaborative closed loop of "extraction-discrimination-optimization".
[0096] The pre-trained Wav2Vec2.0 (CTC architecture) is solidified as a static semantic discriminator, and all its network layer parameters are frozen. The framework's collaborative logic is as follows: the front-end module processes "mixed speech of attendees + reference audio of the target speaker" and outputs target features; Wav2Vec2.0 receives the features, outputs the text prediction result, calculates CTCLoss, and backpropagates the gradient to the front-end module, updating only the parameters of the front-end's AdaLayer modulation layer and convolutional coding layer to achieve collaborative fine-tuning.
[0097] The model training system is planned to achieve low computational cost and high recognition rate. It is divided into two main training phases: the first phase, target feature extraction pre-training, and the second phase, semantic alignment and collaborative fine-tuning. The training tasks, loss functions, and parameter update ranges for each phase are clearly defined, generating configuration information for the two-phase training system. This two-phase training system ensures that the first phase lays the foundation for the second phase, enabling precise semantic optimization in the second phase.
[0098] The first stage involves pre-training for target feature extraction: the training task is to enable the front-end module to acquire basic target feature extraction capabilities and achieve physical feature alignment; the loss function is L1 loss (calculating the difference between the front-end output features and the clean target features); the parameter update range is all parameters of the front-end target feature extraction module (multi-layer convolutional coding layer, AdaLayer modulation layer, output projection layer, etc.); the final output is a front-end module with basic target extraction capabilities.
[0099] The second stage involves fine-tuning for semantic alignment: the training task is to align the features extracted by the front-end module with ASR semantic preferences to address the inconsistency between task objectives; the loss function is adapted to the ASR architecture (CTCLoss for CTC architecture, cross-entropy loss for Encoder-Decoder architecture); the parameter update range is limited to the core parameters of the front-end target feature extraction module (such as the learnable γ and β parameters of AdaLayer, and the convolutional kernel parameters of the convolutional coding layer); the final output is the front-end extraction module that is semantically aligned with ASR.
[0100] For specific interview scenarios, the two-stage training configuration is as follows: In the first stage, the front-end module is trained using L1 loss, enabling it to extract basic features of the interviewee from the mixed audio of "interviewee + reporter + ambient noise"; In the second stage, the front-end module is cascaded with Whisper (Encoder-Decoder architecture), the Whisper parameters are frozen, and cross-entropy loss is used to fine-tune only the AdaLayer parameters of the front-end, so that the extracted features fit Whisper's semantic preferences, thereby improving the interviewee speech recognition rate.
[0101] Based on the architectural differences of mainstream ASRs, targeted loss function adaptation rules were established. CTCLoss was configured for the CTC architecture, and cross-entropy loss was configured for the Encoder-Decoder architecture. Gradient updates were limited to the front-end target feature extraction module, generating a parameter optimization scheme specific to the front-end module. By establishing targeted loss function adaptation rules based on the architectural differences of mainstream ASRs, it was ensured that the semantic loss calculation accurately reflected the differences between the extracted features and the semantic requirements of ASRs, providing precise guidance for front-end module parameter optimization.
[0102] CTC architecture ASR (such as Wav2Vec2.0): Configure CTCLoss, the core formula is as follows (where X is the front-end output feature, Y is the target real text, and P(Y|X) is the probability of the ASR predicted text).
[0103] Encoder-Decoder architecture ASR (such as Whisper): Configure cross-entropy loss, the core formula is... (where N is the length of the text sequence, For the real token at position i, (To predict the probability of the token). All loss gradients are limited to being updated only within the front-end target feature extraction module and not flow to the frozen ASR module, ensuring that the ASR parameters are not modified.
[0104] When adapting to Wav2Vec2.0 (CTC architecture), CTCLoss is used to calculate the loss between the output features of the front-end module and the actual speech text of the interviewee, and the gradient is only propagated back to update the parameters of the convolutional coding layer of the front end; when adapting to Whisper (Encoder-Decoder architecture), cross-entropy loss is used to calculate the loss, and the gradient is only propagated back to update the parameters of the AdaLayer of the front end, thus achieving dedicated parameter optimization for ASR of different architectures.
[0105] A pluggable adaptation trigger mechanism is set up for the extraction module. After training, the extraction module acts as an independent front-end adapter, which can directly connect to any pre-trained ASR system with matching input dimensions without modifying the ASR model structure and parameters, generating cross-architecture, unmodified access rules. Combining the pluggable architecture characteristics of direct connection to feature domains, cross-architecture, unmodified access rules are set up. The core requirement is "zero access cost and no adaptation difference," ensuring that the trained front-end module can seamlessly connect to any pre-trained ASR system with matching input dimensions.
[0106] Regarding the dimension matching rule, the dimension and time axis resolution of the front-end module's output features must strictly match the ASR input requirements (e.g., both being 80-dimensional Mel-frequency features), requiring no additional dimension conversion. Regarding the interface standardization rule, the front-end module's output interface adopts an industry-standard feature tensor format, with no proprietary interface dependencies, allowing direct interface integration with the ASR input. Regarding the independent operation rule, the front-end module possesses complete target feature extraction capabilities, enabling independent operation without relying on any ASR auxiliary modules (such as feature preprocessing or decoding modules). Regarding the adaptation triggering rule, after training, the front-end module functions as an independent front-end adapter, requiring only physical cascading when connecting to ASR, without any parameter adjustments or structural modifications.
[0107] The trained front-end module outputs 80-dimensional Mel-frequency features, with an interface in a standardized tensor format. When connecting to a conferencing system's Wav2Vec2.0 (CTC architecture) or Whisper (Encoder-Decoder architecture), the module's output can be directly connected to the input of either ASR. No modification to the ASR's structure or parameters, or the addition of intermediate conversion modules, is required to achieve cross-architecture, unmodified access and endow both ASRs with target speech recognition capabilities.
[0108] By integrating a standardized collaborative fine-tuning framework, two-stage training system configuration information, front-end module-specific parameter optimization schemes, and cross-architecture, unmodified access rules, a target speech semantic recognition and accurate extraction information system with low computational overhead and high recognition rate is generated. This system includes semantic discrimination benchmarks, full-process training specifications, and multi-architecture adaptation solutions. The integration of these elements forms a complete target speech semantic recognition and accurate extraction information system, ensuring that the information covers the three core aspects of "discrimination benchmarks, training specifications, and adaptation solutions," creating a logical closed loop that can be directly implemented.
[0109] The integrated information core consists of the following: clarifying the selection criteria, parameter freezing requirements, and semantic loss calculation methods for static semantic discriminators, providing a unified benchmark for semantic alignment; clarifying the task objectives, loss functions, parameter update ranges, and training iterations for two-stage training, providing operational guidelines for the training process; and clarifying the access rules, loss function adaptation methods, and dimension matching standards for front-end modules and ASRs of different architectures, providing an execution basis for cross-architecture adaptation.
[0110] The integrated information is as follows: a pre-trained ASR is selected as the static semantic discriminator and its parameters are frozen; in the first stage, all front-end parameters are trained using L1 loss; in the second stage, the corresponding loss function is adapted to the ASR architecture, and only the core front-end parameters are fine-tuned; the front-end module outputs 80-dimensional Mel-frequency features, using a standardized interface that can be integrated into the CTC / Encoder-Decoder architecture ASR without modification. Based on this information, training and deployment can be carried out directly, achieving target speech semantic recognition with low computational overhead and high recognition rate.
[0111] like Figure 2 As shown, a pluggable target speaker speech recognition system includes:
[0112] The requirement architecture adaptation module 201 is used to obtain the core application requirements of target speech recognition in multi-speaker scenarios, and combined with the feature domain direct connection design principle, determine the feature domain direct connection pluggable two-stage training architecture that adapts to the requirements.
[0113] The feature preprocessing and voiceprint extraction module 202 is used to receive mixed audio, target reference audio and clean target audio, convert them into the Mel spectrum feature format specified by the downstream ASR, and extract fixed-dimensional global timbre embedding vector through an adaptive voiceprint network of multi-layer convolution + RMSnorm + MeanPooling to generate feature preprocessing and voiceprint embedding information adapted to ASR.
[0114] The target feature extraction training module 203 is used to build a multi-layer adaptive convolutional coding extraction network. The timbre vector is copied along the time axis and then modulated and mixed with audio features by AdaLayer layer by layer to suppress non-target components. It is then restored to the original Mel spectrum dimension by the output projection layer. Combined with L1 loss, physical alignment training is completed to generate feature modulation fusion and restoration information with basic target extraction capabilities.
[0115] The semantic alignment fine-tuning module 204 is used to directly connect the target feature extraction module to the pre-trained ASR feature domain, freeze all ASR parameters so that the gradient only flows at the front end, input mixed and reference audio to obtain target features, select CTCLoss for CTC and cross-entropy loss for Encoder-Decoder according to the ASR architecture, and generate semantic loss feedback and parameter optimization information that only updates the front end.
[0116] The pluggable adapter building module 205 is used to accurately align the output dimension of the target feature extraction module with the input dimension of ASR, and generate feature dimension matching and lightweight fine-tuning pluggable adaptation information that can seamlessly adapt to mainstream ASR.
[0117] The target speech recognition output module 206 is used to rely on the task consistency collaborative fine-tuning mechanism of the frozen ASR, using the ASR as a static semantic discriminator, integrating the information of the above modules, and generating target speech semantic recognition and accurate extraction information with low computing power overhead and high recognition rate.
[0118] A computing device includes a memory for storing computer program instructions and a processor for executing the computer program instructions, wherein when the computer program instructions are executed by the processor, the device is triggered to execute any pluggable target speaker speech recognition method.
[0119] The methods and / or embodiments in this application can be implemented as computer software programs. For example, embodiments of this disclosure include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing the methods shown in the flowchart. When the computer program is executed by a processing unit, it performs the functions defined in the methods of this application.
[0120] It should be noted that the computer-readable medium described in this application can be a computer-readable signal medium or a computer-readable storage medium, or any combination of the two. More specific examples of a computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above. In this application, a computer-readable medium can be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. It will be apparent to those skilled in the art that this application is not limited to the details of the exemplary embodiments described above, and that this application can be implemented in other specific forms without departing from the spirit or essential characteristics of this application.
Claims
1. A pluggable target speaker speech recognition method, characterized in that, include: The architecture is a two-stage training architecture that directly connects and plugs in the feature domain to meet the target speech recognition requirements in multi-speaker scenarios. This architecture is a cascaded structure of an independent front-end module and a frozen back-end ASR module, consisting of two main parts: a front-end target feature extraction module and a back-end pre-trained ASR module. The model training system is planned in combination with the design goals of low computing power and high recognition rate. It is divided into two training stages: the first stage of target feature extraction pre-training and the second stage of semantic alignment collaborative fine-tuning. The training tasks, loss functions and parameter update ranges of each stage are clarified, and the configuration information of the two-stage training system is generated. The first stage is the pre-training of the target feature extraction module, which includes feature preprocessing of mixed audio, target reference audio and clean target audio, uniformly converting them into the Mel spectral feature format specified by the downstream ASR, and extracting fixed-dimensional global timbre embedding vectors through an adaptive voiceprint network of multi-layer convolution + RMSnorm + MeanPooling, generating Mel spectral normalization features and voiceprint embedding information of the three types of audio adapted to ASR. A multi-layer adaptive convolutional coding extraction network is constructed. The global timbre embedding vector is copied along the time axis and then modulated and mixed by AdaLayer layer by layer to suppress non-target components. The output projection layer restores it to the original Mel spectrum dimension. Combined with L1 loss, physical alignment training is completed to generate feature modulation fusion and restoration information with basic target extraction capabilities. The second stage is to fine-tune the task consistency of the frozen ASR, including directly connecting the target feature extraction module to the pre-trained ASR feature domain, freezing all ASR parameters so that the gradient only flows at the front end, inputting mixed and reference audio to obtain target features, selecting CTCLoss for CTC and cross-entropy loss for Encoder-Decoder according to the ASR architecture, and generating semantic loss feedback and parameter optimization information that only updates the front end. Align the output dimension of the target feature extraction module with the input dimension of ASR to generate feature dimension matching and lightweight, pluggable adaptation information that can be adapted to mainstream ASR. Based on the task consistency collaborative fine-tuning mechanism of frozen ASR, ASR is used as a static semantic discriminator to generate target speech semantic recognition and accurate extraction information.
2. The pluggable target speaker speech recognition method according to claim 1, characterized in that, Feature preprocessing is performed on mixed audio, target reference audio, and clean target audio, uniformly converting them to the Mel spectral feature format specified by downstream ASR. A fixed-dimensional global timbre embedding vector is extracted using a multi-layer convolutional + RMSnorm + MeanPooling adaptive speaker network, generating Mel spectral normalization features and speaker embedding information for the three types of audio adapted to ASR, including: In combination with the feature domain direct connection adaptation requirements of target speech recognition in multi-speaker scenarios, Mel spectrum feature transformation rules and adaptive speaker feature extraction mechanism are introduced to determine standardized feature preprocessing schemes for mixed audio, target reference audio and clean target audio. Perform feature preprocessing operations on the mixed audio, target reference audio, and clean target audio, and convert the three types of audio into the Mel spectral feature format specified by the downstream ASR model; An adaptive voiceprint feature extraction network was constructed, consisting of a multi-layer convolutional encoder, RMSOrm root mean square normalization, MeanPooling time axis average aggregation, and a linear projection layer, with a fixed hidden layer mapping dimension set. The standardized target reference audio Mel spectral features are input into the adaptive speaker feature extraction network. After multi-layer convolutional coding, normalization and aggregation processing, they are mapped to the preset hidden layer dimension through a linear projection layer. Extract and generate a fixed-dimensional global timbre embedding vector from the feature processing results of the target reference audio; By integrating the Mel spectrum normalization features of multiple audio types with the target global timbre embedding vector, Mel spectrum normalization features and voiceprint embedding information of three types of audio are generated to meet the input requirements of downstream ASR models.
3. The pluggable target speaker speech recognition method according to claim 1, characterized in that, A multi-layer adaptive convolutional coding extraction network is constructed. The global timbre embedding vector is copied along the time axis and then modulated layer by layer by AdaLayer to mix audio features and suppress non-target components. The output projection layer restores the original Mel spectrum dimension. Physical alignment training is completed by combining L1 loss, generating feature modulation fusion and restoration information with basic target extraction capabilities, including: Combining the feature domain direct connection adaptation requirements of target speech recognition in multi-speaker scenarios with the two-stage training logic, a backend ASR parameter freezing mechanism and a structured loss adaptation rule are introduced to build a collaborative fine-tuning system between the target feature extraction module and the pre-trained ASR. According to the cascading specification of direct connection of feature domains, the output of the target feature extraction module completed in the first stage of training is directly connected to the input of the pre-trained ASR model to complete the end-to-end cascading adaptation between the target feature extraction module and the ASR model. Based on the core requirement of task consistency and collaborative fine-tuning, all parameters of the pre-trained ASR model are frozen, and the gradient is limited to flow only within the front-end target feature extraction module, thus completing the parameter constraint setting of the fine-tuning system. The mixed audio and target reference audio are input into the cascaded system. After the target speaker features are output by the front-end target feature extraction module, they are fed into the frozen ASR model to generate text prediction results. For different ASR architectures such as CTC and Encoder-Decoder, CTCLoss and cross-entropy loss are selected to calculate semantic loss respectively. By using semantic loss gradient backpropagation to update only the model parameters of the front-end target feature extraction module, semantic loss feedback and parameter optimization information are generated that only require minor fine-tuning of the front-end, thus achieving semantic-level alignment between the extracted features and the ASR model.
4. The pluggable target speaker speech recognition method according to claim 1, characterized in that, The target feature extraction module is directly cascaded with the pre-trained ASR feature domain. All ASR parameters are frozen so that the gradient flows only at the front end. The target features are obtained by inputting mixed and reference audio. According to the ASR architecture, CTCLoss is selected for CTC and cross-entropy loss is selected for Encoder-Decoder. Semantic loss feedback and parameter optimization information that only update the front end are generated, including: Combining the pluggable adaptation requirements of target speech recognition in multi-speaker scenarios with the direct connection architecture logic of feature domains, we introduce strict dimensional alignment rules and lightweight fine-tuning mechanisms to carry out adaptive optimization design of the target feature extraction module. Based on the input dimension requirements of mainstream downstream ASR models, the output dimension of the front-end target feature extraction module is precisely aligned to make it perfectly match the ASR input dimension. Based on the design principle of direct connection of feature domains, the independent plug-in attribute of the target feature extraction module is retained, so that it does not depend on the internal structure of a specific ASR model, thus completing the construction of the core features of the pluggable architecture. By combining the training logic with only minor adjustments to the front end, the parameter scale of the target feature extraction module is maintained at a lightweight level, ensuring that it requires only a small amount of computing power when adapting to different ASRs, thus completing the computing power optimization of the adaptation process. It generates feature dimension matching and lightweight, pluggable adaptation information that enables the target feature extraction module to adapt to various mainstream ASR models by generating output dimension matching, architecture decoupling and independence, and fine-tuning computing power.
5. The pluggable target speaker speech recognition method according to claim 4, characterized in that, Align the output dimension of the target feature extraction module with the input dimension of ASR to generate feature dimension matching and lightweight, pluggable adaptation information that is compatible with mainstream ASR, including: Based on the task consistency collaborative fine-tuning mechanism of two-stage training and the requirements of multi-speaker scene recognition, the pre-trained ASR model with frozen parameters is used as a static semantic discriminator to generate semantic evaluation benchmark information that is adapted to the target feature extraction module. Based on the design goals of low computational overhead and high recognition rate, the output features of collaborative fine-tuning are semantically optimized, the alignment requirements between extracted features and ASR model semantic preferences are clarified, and semantic adaptation criteria information of target speech features is generated. By leveraging the pluggable architecture of direct connection to feature domains, we set up access rules for the target feature extraction module to adapt to various mainstream ASRs without modification, ensuring seamless integration with ASR systems with the same input dimensions during the inference phase without additional adaptation costs. The semantic evaluation benchmark information, semantic adaptation criterion information, and unmodified access rules are integrated and processed to generate target speech semantic recognition and accurate extraction information that includes discrimination benchmark, semantic alignment, and convenient access.
6. The pluggable target speaker speech recognition method according to claim 1, characterized in that, Based on the task consistency collaborative fine-tuning mechanism of frozen ASR, ASR is used as a static semantic discriminator to generate target speech semantic recognition and accurate extraction information, including: Based on the core requirements of target speech recognition in multi-speaker scenarios, we extract the core technical elements of frozen ASR parameters, semantic loss reverse guidance, direct connection of feature domains, and only lightweight fine-tuning of the front end. The extracted core technical elements are verified in a scenario-based manner to confirm the feasibility and logical closed loop of each element in overlapping speech processing and multi-architecture ASR adaptation, and to generate technical element verification parameters. Based on the task consistency collaborative fine-tuning mechanism of frozen ASR, the verified technical elements are integrated and encapsulated, the pre-trained ASR model is solidified into a static semantic discriminator, and a standardized collaborative fine-tuning framework is generated. The model training system is planned in combination with the design goals of low computing power and high recognition rate. It is divided into two training stages: the first stage of target feature extraction pre-training and the second stage of semantic alignment collaborative fine-tuning. The training tasks, loss functions and parameter update ranges of each stage are clarified, and the configuration information of the two-stage training system is generated. Based on the architectural differences of mainstream ASR, we set targeted loss function adaptation rules, configured CTCLoss for CTC architecture and cross-entropy loss for Encoder-Decoder architecture, and limited the gradient to be backpropagated and updated only in the front-end target feature extraction module, generating a parameter optimization scheme exclusive to the front-end module. A pluggable adaptation triggering mechanism is set for the target feature extraction module. After training, the target feature extraction module acts as an independent front-end adapter, which can be directly connected to the pre-trained ASR system that matches the input dimension without modifying the ASR model structure and parameters, and generate cross-architecture access rules without modification. By integrating a standardized collaborative fine-tuning framework, two-stage training system configuration information, front-end module-specific parameter optimization schemes, and cross-architecture unmodified access rules, target speech semantic recognition and accurate extraction information are generated, including semantic discrimination benchmarks, full-process training specifications, and multi-architecture adaptation schemes.
7. A pluggable speech recognition system for a target speaker, characterized in that, The system is used to execute executable instructions to perform the pluggable target speaker speech recognition method according to any one of claims 1 to 6.
8. An electronic device, characterized in that, include: First processor; and a memory for storing executable instructions of the first processor; wherein the first processor is configured to execute the pluggable target speaker speech recognition method according to any one of claims 1 to 6 by executing the executable instructions.
9. A computing device, the device comprising a memory for storing computer program instructions and a processor for executing the computer program instructions, wherein, When the computer program instructions are executed by the processor, the device is triggered to perform the pluggable target speaker speech recognition method according to any one of claims 1 to 6.