A dialect language recognition method based on multi-modal fusion
By using a multimodal fusion dialect language recognition method, multimodal data is collected and preprocessed, and a feature encoder and attention mechanism model are constructed. This solves the problem of insufficient compatibility of dialect recognition features in existing technologies, and achieves efficient dialect recognition and improved robustness.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- 海南经贸职业技术学院
- Filing Date
- 2026-02-25
- Publication Date
- 2026-06-19
AI Technical Summary
Existing dialect recognition methods are based on single-modal features or traditional deep learning frameworks, resulting in insufficient feature compatibility and difficulty in learning the relationships between multiple dialects. Furthermore, multimodal models often sacrifice speech recognition performance when optimizing dialect recognition performance, failing to meet the needs of multi-task collaborative optimization.
By collecting and preprocessing multimodal data of dialects and Mandarin, audio and text features are extracted and integrated, a dialect language recognition model based on feature encoder and attention mechanism is constructed, and the model is trained using feature association dataset to capture local features and global dependencies of speech signals and generate text sequences.
It achieves efficient fusion of multimodal features, improves the accuracy and robustness of dialect recognition, solves the problem of rapid recognition of multiple dialects, and enhances the model's recognition accuracy and scene adaptability for the target dialect.
Smart Images

Figure CN122245288A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of dialect language recognition technology, and in particular to a dialect language recognition method based on multimodal fusion. Background Technology
[0002] The Chinese dialect system is complex, with numerous variants that differ significantly in pronunciation, vocabulary, and grammatical structure. At the same time, they share similar phonological rules due to a common origin, posing a significant challenge to dialect recognition. With the increasing demands in areas such as smart government, remote services, and cultural heritage protection, the accuracy, robustness, and scenario adaptability of dialect recognition have become core requirements.
[0003] Most existing dialect recognition methods are based on single-modal features or traditional deep learning frameworks, resulting in insufficient feature compatibility. Existing methods fail to handle the acoustic and textual features of both types of data in a compatible manner, leading to differences in dimensionality, distribution, and semantic representation. This makes it difficult for the model to learn the relationship between the two, resulting in low dialect recognition accuracy. Furthermore, their learning ability is poor; in the case of multiple dialects, the model cannot learn how to learn dialects while learning multiple dialects, failing to solve the problem of rapid dialect recognition for the target language. Moreover, some existing multimodal models often sacrifice speech recognition (ASR) performance when optimizing dialect recognition performance, resulting in a performance trade-off that fails to meet the needs of multi-task collaborative optimization in practical applications. Summary of the Invention
[0004] Therefore, the purpose of this invention is to provide a dialect language recognition method based on multimodal fusion to solve or at least partially solve the above-mentioned problems existing in the prior art.
[0005] To achieve the above objectives, this invention provides a dialect language recognition method based on multimodal fusion, the method comprising the following steps: S101. Collect the first multimodal data of dialect language and the data of Mandarin, and perform preprocessing operations on them; S102. Perform feature extraction on the first multimodal data and Mandarin data after preprocessing to obtain dialect language features and Mandarin features; S103. Perform feature compatibility processing on dialect language features and Mandarin features to obtain a feature association dataset; S104. Construct a dialect language recognition model and train the dialect language recognition model using a feature association dataset; S105. Collect the second multimodal data of the dialect language to be identified, extract multimodal features from the second multimodal data, input the extracted multimodal features into the trained dialect language recognition model, and output the corresponding dialect recognition result.
[0006] Furthermore, the preprocessing operations include data cleaning and data augmentation. The data cleaning is used to remove invalid data, unify data format, remove redundant characters and non-linguistic characters, and the data augmentation is used for speech rate adjustment, audio compression and extension, and noise addition.
[0007] Furthermore, step S102 specifically includes the following steps: S21. Perform a frame-segmentation operation on the speech signals of the first multimodal data and the Mandarin data to obtain dialect short-time signal segments and Mandarin short-time signal segments. S22. Extract the Mel-Cepstral, Gamma-Ton cepstral, and frame energy features of each dialect short-time signal segment and the Mandarin short-time signal segment respectively; S23. The Mel-Cepstral and Gamma-Ton cepstral features of each dialect short-time signal segment and the Mandarin short-time signal segment are concatenated according to the feature column dimension. The frame energy features are then extended to the same dimension as the Mel-Cepstral and Gamma-Ton cepstral features and concatenated according to the feature column dimension to form the final dialect language features and Mandarin features.
[0008] Furthermore, step S103 specifically includes the following steps: S31. Perform feature dimension verification on dialect language features and Mandarin features, and unify the dimensions and data format of dialect language features and Mandarin features; S32. Perform a distribution alignment operation on dialect language features and Mandarin features to make the distribution of dialect language features consistent with the distribution of Mandarin features; S33. Perform semantic mapping operations on dialect language features and Mandarin features to establish the association between dialect language features and Mandarin features; S34. The dialect language features after alignment and mapping operations are fused and encapsulated with the Mandarin features to generate a feature association dataset.
[0009] Furthermore, the dialect language recognition model is constructed from a feature encoder, an attention mechanism decoder, and an auxiliary decoding branch. The dialect language recognition model is based on the feature encoder, which uses a convolutional neural network and a multi-head self-attention mechanism to capture the local features and global dependencies of speech signals in the feature association dataset. The attention mechanism decoder generates text sequences based on the feature association dataset through the attention mechanism, and an auxiliary decoding branch is introduced to perform alignment supervision on the attention mechanism decoder.
[0010] Furthermore, the training of the dialect language recognition model using the feature association dataset specifically includes the following steps: S41. Select the dialect language features of the dialect to be identified and their corresponding Mandarin features from the feature association dataset, and use them as the total target task set. Use the remaining dialect language features and their corresponding Mandarin features as the total dialect task set. S42. Randomly select different dialect language features and their corresponding Mandarin features from the total dialect task set as sub-dialect task sets; S43. Randomly select a dialect language feature and its corresponding Mandarin feature from the overall target task set as a sub-target task set, and divide the sub-target task set into a sub-target training set and a dialect verification set according to a preset ratio; S44. Combine the sub-dialect task set and the sub-target training set into a dialect training set, and use the dialect training set to optimize the initial parameters of the dialect language recognition model.
[0011] Furthermore, the optimization of dialect language recognition model parameters using a dialect training set specifically includes the following steps: S51. Based on the initial parameters of the dialect language recognition model, define the initial parameter weights, calculate the gradient of the dialect language recognition model using the sub-dialect task set, update the initial parameter weights, and obtain the first parameter weights, as shown below:
[0012] in, The first parameter is the weight. For the dialect task set, These are the initial parameter weights; S52. Based on the first parameter weight, the dialect language recognition model is optimized using the sub-target training set to update the first parameter weight and obtain the second parameter weight, as shown below:
[0013] in, The weight is the second parameter. For the sub-target training set, To update the step count; S53. Optimize the initial parameters based on the second parameter weight to form a new dialect language recognition model, and use the dialect verification set to verify the dialect recognition effect of the new dialect language recognition model.
[0014] Furthermore, the dialect recognition results can be converted into preset format files, and the dialect recognition results include dialect language category and the corresponding Mandarin data.
[0015] Compared with the prior art, the beneficial effects of the present invention are: This invention proposes a dialect language recognition method based on multimodal fusion. By performing feature compatibility processing on dialect language features and Mandarin features, the model can effectively learn the association patterns between the two types of features. A dialect language recognition model is constructed to capture the local features and global dependencies of speech signals. Text sequences are generated based on a feature association dataset, and the model is trained using this dataset, improving its knowledge learning and understanding. This invention achieves efficient fusion of multimodal features, improving the accuracy and robustness of dialect recognition. Attached Figure Description
[0016] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only preferred embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0017] Figure 1 This is a schematic diagram of a dialect language recognition method based on multimodal fusion provided in an embodiment of the present invention. Detailed Implementation
[0018] The principles and features of the present invention are described below with reference to the accompanying drawings. The listed embodiments are only used to explain the present invention and are not intended to limit the scope of the present invention.
[0019] Reference Figure 1 This embodiment provides a dialect language recognition method based on multimodal fusion, the method including the following steps: S101. Collect the first multimodal data of the dialect language and the Mandarin data, and perform preprocessing operations on them. The first multimodal data and the Mandarin data include audio data and text data.
[0020] S102. Feature extraction is performed on the preprocessed first multimodal data and Mandarin data to obtain dialect language features and Mandarin language features. The dialect language features and Mandarin language features respectively include audio language features and text language features.
[0021] S103. Perform feature compatibility processing on dialect language features and Mandarin features, including performing feature compatibility processing on audio language features in dialect language features and text language features in Mandarin features, performing feature compatibility processing on text language features in dialect language features and the above-mentioned feature compatibility processed audio language features and text language features to obtain a feature-associated dataset.
[0022] S104. Construct a dialect language recognition model and train it using a feature association dataset.
[0023] S105. Collect the second multimodal data of the dialect language to be identified, extract multimodal features from the second multimodal data, input the extracted multimodal features into the trained dialect language recognition model, and output the corresponding dialect recognition result.
[0024] By training a dialect language recognition model using a feature association dataset, the model learns how to learn dialects while learning multiple dialects, acquiring rich knowledge of common dialect features. This knowledge is then used to acquire target dialect knowledge, improving the accuracy of the dialect language recognition model in identifying the target dialect and solving the problem of rapid identification of the target dialect.
[0025] In step S101, the first multimodal data of the dialect language is collected through community surveys and recordings, and the sampling formats include 16kHz, mono, and WAV formats. The preprocessing operation specifically includes the following: Data cleaning was performed on the audio data in the first multimodal dataset and the Mandarin data, including: precise alignment of audio and annotation files: using automated tools such as Dynamic Time Warping (DTW) to ensure strict alignment between annotations and audio, and removing invalid data that was not aligned; removal of invalid or low-quality data: removing audio with a volume below a set threshold (-30dBFS) and a signal-to-noise ratio below 15dB; unification of audio format: converting all audio to a 16kHz sampling rate and converting it to mono WAV format to ensure consistency; silence segmentation: segmenting audio based on silence (lasting 0.5 seconds or more); and unifying segmentation length, etc.
[0026] Data cleaning was performed on the text data in the first multimodal dataset and the Mandarin dataset, including: Speech rate adjustment: Expanding the corpus and increasing the diversity of speech samples through dynamic time warping (0.9x, 1.1x speed). Audio compression and extension: Simulating changes in speech rhythm in dialects, as well as actual speech rate differences and emotional fluctuations, to enhance sample coverage. Noise addition: Adding background noise (wind, street sounds, etc.) with different signal-to-noise ratios to simulate real-world scenarios, improving the model's robustness to complex scenes, and introducing convolutional noise perturbation to enhance the robustness of speech recognition. SpecAugment: Temporal masking, randomly masking a portion of time frames (masking 25% of random frames); Frequency masking, randomly masking a portion of frequency bands (masking 2-5 frequency bands). Silence correction: Eliminating meaningless silences and optimizing audio clarity.
[0027] Step S102 specifically includes the following steps: S21. Perform a frame-segmentation operation on the speech signals of the first multimodal data and the Mandarin data to obtain dialect short-time signal segments and Mandarin short-time signal segments. S22. Extract the Mel-Cepstral, Gamma-Ton cepstral, and frame energy features from each dialect short-time signal segment and the Mandarin short-time signal segment, respectively. Concatenating the Mel-Cepstral and Gamma-Ton cepstral features effectively characterizes the auditory characteristics of speech. The frame energy features effectively characterize the differences between voiced and unvoiced sounds in dialects. Combining the Mel-Cepstral and Gamma-Ton cepstral features with the frame energy features yields rich linguistic features.
[0028] S23. The Mel-Cepstral and Gamma-Ton cepstral features of each dialect short-time signal segment and the Mandarin short-time signal segment are concatenated according to the feature column dimension. The frame energy features are then extended to the same dimension as the Mel-Cepstral and Gamma-Ton cepstral features and concatenated according to the feature column dimension to form the final dialect language features and Mandarin features.
[0029] The extraction of Mel-Cepstral features of each dialect short-time signal segment and the Mandarin short-time signal segment specifically includes the following steps: S2.1 Perform a Fast Fourier Transform on the speech signals of each dialect short-time signal segment and the Mandarin short-time signal segment to convert them into frequency domain signals.
[0030] S2.2 Filter the frequency domain signal through a set of Mel frequency filters to convert it into a Mel frequency spectrum.
[0031] S2.3. Perform a logarithmic operation on the Mel frequency spectrum to obtain the logarithmic Mel frequency spectrum coefficients. Perform a discrete cosine transform on the Mel frequency spectrum coefficients to obtain the Mel cepstral characteristics, as shown below:
[0032] in, For the first Mel-Cepstral Features For discrete cosine transform, For the first A Mel frequency spectrum.
[0033] The extraction of gamma-pass cepstral features from each dialect short-time signal segment and the Mandarin short-time signal segment is similar to the extraction of Mel-pass cepstral features, but a linear gamma-pass frequency domain filter is used, as shown below:
[0034] in, For the first Gamma-pass cepstral features The first one after processing by the linear gamma pass frequency domain filter A gamma spectrum.
[0035] Step S103 specifically includes the following steps: S31. Perform feature dimension verification on dialect language features and Mandarin features, and unify the dimensions and data format of dialect language features and Mandarin features; S32. Perform a distribution alignment operation on dialect language features and Mandarin features to make the distribution of dialect language features consistent with the distribution of Mandarin features; S33. Perform semantic mapping operations on dialect language features and Mandarin features to establish the association between dialect language features and Mandarin features; S34. The dialect language features after alignment and mapping operations are fused and encapsulated with the Mandarin features to generate a feature association dataset.
[0036] Step S103 addresses the problem of the inability to train Mandarin and dialect language data collaboratively due to differences in feature dimensions and distribution. It involves performing feature compatibility processing on dialect language features and Mandarin features, enabling the dialect language recognition model to learn the feature association rules between dialect language features and Mandarin features, thereby supporting the training and application of the dialect language recognition model.
[0037] In step S104, the dialect language recognition model is constructed from a feature encoder, an attention mechanism decoder, and an auxiliary decoding branch (connectionist temporal classification auxiliary decoding branch CTC). Each layer of the feature encoder includes a feedforward module, a multi-head autonomous attention mechanism, and a convolutional neural network. The attention mechanism decoder is a bidirectional long short-term memory network with a hidden dimension of 512 in each layer. The connectionist temporal classification auxiliary decoding branch is inserted after the 6th layer of the feature encoder to assist training, accelerate convergence, and employ phoneme-level supervision to improve the learning effect of low-level features.
[0038] The proposed dialect language recognition model utilizes a feature encoder, employing a convolutional neural network and a multi-head self-attention mechanism to capture local features and global dependencies of speech signals in a feature-association dataset. Relative positional encoding is introduced to enhance the modeling ability for long sentences. An attention-based decoder, based on a bidirectional long short-term memory network, generates text sequences from the feature-association dataset using an attention mechanism to capture the contextual relationships within the text sequences and optimize decoding accuracy. An auxiliary decoding branch is introduced to supervise the alignment of the attention-based decoder. A Conformer encoder extracts speech features, and a LAS-CTC joint decoder forms a complementary approach, addressing temporal alignment issues and improving language accuracy in low-resource scenarios during dialect recognition.
[0039] In step S104, training the dialect language recognition model using the feature association dataset specifically includes the following steps: S41. Select the dialect language features of the dialect to be identified and their corresponding Mandarin features from the feature association dataset, and use them as the overall target task set. For example, select low-resource dialects as the dialects to be identified. Then use the remaining dialect language features and their corresponding Mandarin features as the overall dialect task set.
[0040] S42. Randomly select different dialect language features and their corresponding Mandarin features from the general dialect task set as sub-dialect task sets.
[0041] S43. Randomly select a dialect language feature and its corresponding Mandarin feature from the overall target task set as a sub-target task set, and divide the sub-target task set into a sub-target training set and a dialect verification set according to a preset ratio.
[0042] S44. Combine the sub-dialect task set and the sub-target training set into a dialect training set, and use the dialect training set to optimize the initial parameters of the dialect language recognition model, as shown below:
[0043] in, For dialect training materials, For the dialect task set, For the sub-target training set.
[0044] In step S44, optimizing the dialect language recognition model parameters using the dialect training set specifically includes the following steps: S51. Based on the initial parameters of the dialect language recognition model, define the initial parameter weights, calculate the gradient of the dialect language recognition model using the sub-dialect task set, update the initial parameter weights, and obtain the first parameter weights, as shown below:
[0045] in, The first parameter is the weight. For the dialect task set, These are the initial parameter weights.
[0046] S52. Based on the first parameter weight, the dialect language recognition model is optimized using the sub-target training set to update the first parameter weight and obtain the second parameter weight, as shown below:
[0047] in, The weight is the second parameter. For the sub-target training set, To update the step count.
[0048] S53. Optimize the initial parameters based on the weight of the second parameter to obtain new parameters. Then, optimize the dialect language recognition model according to the new parameters to form a new dialect language recognition model. Finally, validate the new dialect language recognition model using a dialect validation set to verify its dialect recognition performance. This is represented as follows:
[0049] in, For the new parameters, These are the initial parameters. The learning rate is the learning rate for the dialect language recognition model.
[0050] In step S104, the dialect learning ability is obtained from the process of learning the different dialect language features and their corresponding Mandarin features in the sub-dialect task set. The dialect language recognition model parameters are updated according to the dialect learning ability to improve the dialect language recognition model's learning and understanding of target dialect knowledge, thereby improving the dialect language recognition model's recognition efficiency and accuracy of the target dialect.
[0051] In step S105, the dialect recognition result can be converted into a preset format file. The dialect recognition result includes the dialect language category and the corresponding Mandarin data. The preset format file includes JSON, XML, and TXT formats, etc., and users can convert the dialect recognition result into the corresponding preset format file according to their needs.
[0052] The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.
Claims
1. A dialect language recognition method based on multi-modal fusion, characterized in that, The method includes the following steps: S101. Collect the first multimodal data of dialect language and the data of Mandarin, and perform preprocessing operations on them; S102. Perform feature extraction on the first multimodal data and Mandarin data after preprocessing to obtain dialect language features and Mandarin features; S103. Perform feature compatibility processing on dialect language features and Mandarin features to obtain a feature association dataset; S104. Construct a dialect language recognition model and train the dialect language recognition model using a feature association dataset; S105. Collect the second multimodal data of the dialect language to be identified, extract multimodal features from the second multimodal data, input the extracted multimodal features into the trained dialect language recognition model, and output the corresponding dialect recognition result.
2. The dialect language recognition method based on multi-modal fusion according to claim 1, characterized in that, The preprocessing operations include data cleaning and data augmentation. Data cleaning is used to remove invalid data, unify data format, remove redundant characters and non-language characters, and data augmentation is used to adjust speech rate, compress and extend audio, and add noise.
3. The dialect language recognition method based on multi-modal fusion according to claim 1, characterized in that, Step S102 specifically includes the following steps: S21. Perform a frame-segmentation operation on the speech signals of the first multimodal data and the Mandarin data to obtain dialect short-time signal segments and Mandarin short-time signal segments. S22. Extract the Mel-Cepstral, Gamma-Ton cepstral, and frame energy features of each dialect short-time signal segment and the Mandarin short-time signal segment respectively; S23. The Mel-Cepstral and Gamma-Ton cepstral features of each dialect short-time signal segment and the Mandarin short-time signal segment are concatenated according to the feature column dimension. The frame energy features are then extended to the same dimension as the Mel-Cepstral and Gamma-Ton cepstral features and concatenated according to the feature column dimension to form the final dialect language features and Mandarin features.
4. The dialect language recognition method based on multi-modal fusion according to claim 3, characterized in that, Step S103 specifically includes the following steps: S31. Perform feature dimension verification on dialect language features and Mandarin features, and unify the dimensions and data format of dialect language features and Mandarin features; S32. Perform a distribution alignment operation on dialect language features and Mandarin features to make the distribution of dialect language features consistent with the distribution of Mandarin features; S33. Perform semantic mapping operations on dialect language features and Mandarin features to establish the association between dialect language features and Mandarin features; S34. The dialect language features after alignment and mapping operations are fused and encapsulated with the Mandarin features to generate a feature association dataset.
5. The dialect language recognition method based on multimodal fusion according to claim 1, characterized in that, The dialect language recognition model is constructed from a feature encoder, an attention mechanism decoder, and an auxiliary decoding branch. The dialect language recognition model is based on the feature encoder, which uses a convolutional neural network and a multi-head self-attention mechanism to capture the local features and global dependencies of speech signals in the feature association dataset. The attention mechanism decoder generates text sequences based on the feature association dataset through the attention mechanism, and an auxiliary decoding branch is introduced to perform alignment supervision on the attention mechanism decoder.
6. The dialect language recognition method based on multimodal fusion according to claim 1, characterized in that, The process of training the dialect language recognition model using a feature association dataset specifically includes the following steps: S41. Select the dialect language features of the dialect to be identified and their corresponding Mandarin features from the feature association dataset, and use them as the total target task set. Use the remaining dialect language features and their corresponding Mandarin features as the total dialect task set. S42. Randomly select different dialect language features and their corresponding Mandarin features from the total dialect task set as sub-dialect task sets; S43. Randomly select a dialect language feature and its corresponding Mandarin feature from the overall target task set as a sub-target task set, and divide the sub-target task set into a sub-target training set and a dialect verification set according to a preset ratio; S44. Combine the sub-dialect task set and the sub-target training set into a dialect training set, and use the dialect training set to optimize the initial parameters of the dialect language recognition model.
7. A dialect language recognition method based on multimodal fusion according to claim 6, characterized in that, The optimization of dialect language recognition model parameters using a dialect training set specifically includes the following steps: S51. Based on the initial parameters of the dialect language recognition model, define the initial parameter weights, calculate the gradient of the dialect language recognition model using the sub-dialect task set, update the initial parameter weights, and obtain the first parameter weights, as shown below: in, The first parameter is the weight. For the dialect task set, These are the initial parameter weights; S52. Based on the first parameter weight, the dialect language recognition model is optimized using the sub-target training set to update the first parameter weight and obtain the second parameter weight, as shown below: in, The weight is the second parameter. For the sub-target training set, To update the step count; S53. Optimize the initial parameters based on the second parameter weight to form a new dialect language recognition model, and use the dialect verification set to verify the dialect recognition effect of the new dialect language recognition model.
8. The dialect language recognition method based on multimodal fusion according to claim 1, characterized in that, The dialect recognition results can be converted into preset format files, and the dialect recognition results include dialect language category and the corresponding Mandarin data.