Speech recognition system based on MATLAB

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By developing a GUI interface on the MATLAB platform, combining HMM and MFCC algorithms, and using the DTW algorithm for speech signal matching, the complexity problem in the isolated speech recognition process was solved, and a high-efficiency Chinese digital speech recognition system was realized with a recognition rate of over 90%.

CN122201267APending Publication Date: 2026-06-12NINGBO TENGYUE ZHIDA TECHNOLOGY DEVELOPMENT CO LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: NINGBO TENGYUE ZHIDA TECHNOLOGY DEVELOPMENT CO LTD
Filing Date: 2024-12-11
Publication Date: 2026-06-12

Application Information

Patent Timeline

11 Dec 2024

Application

12 Jun 2026

Publication

CN122201267A

IPC: G10L15/14; G10L15/197; G10L25/24; G10L25/18; G06F18/23213

AI Tagging

Application Domain

Speech recognition

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

Existing technologies are complex in the conversion from non-continuous speech recognition to continuous speech recognition, especially in the recognition of isolated speech signals. The lack of an intuitive graphical user interface and efficient algorithm support leads to complicated operation and low recognition rate.

⚗Method used

A graphical user interface was developed using the MATLAB platform. The HMM and MFCC algorithms were combined, and the DTW algorithm was used for speech signal matching and recognition. Automatic preprocessing and recognition of speech signals were implemented through GUI interface design, and real-time data processing was performed using MATLAB's signal processing toolkit.

🎯Benefits of technology

The Chinese digital speech recognition system has achieved a recognition rate of over 90%, is easy to operate, has a beautiful and intuitive interface, and improves recognition efficiency and user experience.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure FT_1
Figure FT_2
Figure FT_3

Patent Text Reader

Abstract

The application relates to a kind of recognition of isolated speech signal using MATLAB.HMM is used as the main recognition algorithm, and MFCC is used as the main speech feature parameter to establish a Chinese digital speech recognition system.DTW is used as the comparison algorithm, and the recognition result is compared with the HMM algorithm.Both include the pre-processing of speech signal, the extraction of feature parameter, the training of recognition template and the recognition matching algorithm.Furthermore, the speech recognition system interface is developed and designed using the MATLAB graphical user interface, which is simple in design, convenient to use and beautiful in system interface.Through the recognition rate statistics, the recognition effect is obvious, and the recognition rate reaches more than 90% in the experimental environment.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to a method for recognizing isolated speech signals using MATLAB. A Chinese digit speech recognition system is established using Hidden Markov Models (HMM) as the primary recognition algorithm and MFCC as the main speech feature parameters. The recognition results are compared with those of the HMM algorithm using DTW as a comparison algorithm. Both methods include speech signal preprocessing, feature parameter extraction, template training, and matching algorithms. Furthermore, a graphical user interface for the speech recognition system is developed using MATLAB; the interface is simple, easy to use, and aesthetically pleasing. Statistical analysis shows that the recognition effect is significant, achieving a recognition rate of over 90% in experimental conditions. Background Technology

[0002] Since the 1990s, significant progress has been made in model design, parameter extraction and optimization, and adaptive technology systems. Speech recognition technology has matured and entered the market. Currently, research on speech recognition technology is accelerating globally, with the aim of strengthening market share. The research focus has shifted from discontinuous speech to general continuous speech, primarily because the latter includes more complete grammatical information, enabling the recognition of continuous sentences. This method best simulates natural human speech. However, the transition from discontinuous to continuous speech is technically complex because discontinuous speech recognition involves only isolated sound wave segments. In achieving this transition, continuous speech recognition requires consideration of how to segment sound waves. Summary of the Invention

[0003] System Overall Design: As is well known, the design and development of speech recognition technology requires a sound algorithm; only in this way can a better system be developed more stably. This system leverages the mathematical computational capabilities of the MATLAB simulation system to achieve the goal of recognizing isolated speech signals. This is mainly due to the superior performance of MATLAB, which comes with a large number of signal processing tools. The toolkit enables real-time feedback during data processing. It should be noted that the algorithms used in the MATLAB simulation software toolkit are relatively advanced. Compared to traditional implementation methods, this platform allows for the development of speech recognition systems through a graphical interface. Furthermore, the speech recognition process and results can be intuitively demonstrated during the development and design process. The system employs the HMM (Hidden Markov Models) recognition algorithm combined with MFCC (MEL frequency cepstral coefficients) speech feature parameters to construct a Chinese digital speech recognition system.

[0004] System Implementation - Implementation of the DTW Algorithm in MATLAB: First, define the matrices, where d and D represent n*m matrices, representing the frame matching distance and cumulative distance, respectively, while n and m represent the number of frames in the test template and reference template, respectively. Next, calculate the frame matching distance matrix d for the two templates using a loop. After completing the above steps, implement dynamic programming, assigning each grid point (i,j) the cumulative distances 1D, 2D, and 3D of the three feasible preceding grid points. Furthermore, pay attention to boundary issues, primarily to prevent the following situation: if a preceding grid point does not exist, additional criteria need to be added. After completing the above steps, use the minimum function min to compare the cumulative distances, taking the smallest value, and add it to the current frame's matching distance d(i,j), thus determining the current grid point's cumulative distance. Repeat the above steps until grid point (n,m), outputting the value of D(n,m). In this case, it can be considered the result of matching the speech signal template library. By running the main function dtw.m, MATLAB will automatically read the WAV format audio files from the corresponding folder, perform endpoint detection, calculate the nonlinear prediction coefficients (MFCC), and save them as reference and test libraries. Next, the DTW algorithm is used to match and calculate their dist distances. Finally, the results are obtained, and the test results and dist distance data can be visually observed in the MATLAB simulation software's command window. Based on the calculated dist distance data, it can be seen that the data on the diagonal are all the minimum values in that row, thus confirming the correctness of the experiment.

[0005] System Implementation - Implementation of the HMM Algorithm in MATLAB: Writing a recognition system program using the HMM algorithm requires many functions. This section explains each function call based on the overall HMM algorithm implementation process. 1) Calculate the output probability of the Gaussian mixture. 2) Calculate the output probability of the observation vector x for a given HMM state; in other words, the linear combination of the output probabilities of x for the given state's Gaussian mixture elements. This is mainly accomplished using code in `mixture.m`. 3) The MATLAB program `getparam.m` calculates various parameters. 4) The Viterbi algorithm, which uses a logarithmic form. The recognition program requires input of HMM model parameters and a test speech observation sequence, then calculates its output probability for the model and provides the optimal approach. 5) The Baum-Welch algorithm during training. Using the Baum-Welch algorithm, the transition probabilities are calculated using formulas, and the mean, variance, and weight coefficients of the PDF are also calculated using formulas. The function `baum.m` is a MATLAB implementation of the Baum-Welch algorithm for one iteration. It first calls the `getparam` function to calculate various parameters for each input observation sequence, then re-evaluates all parameters of the Hidden Markov Model (HMM), and finally returns the new parameters as output variables. 6) Initialize HMM parameters. Before implementing the Baum-Welch iteration, the HMM parameters are first initialized, mainly by the function `initmm.m`. During initialization, the initial probability `hmm.init` is determined, i.e., it is defined as an array where the first element is 1 and the others are 0, and its length is the number of HMM states, N. Finally, the mean, variance, and weight coefficients of the Gaussian mixture function for each state are initialized. It should be noted that in this study, based on the number of states, the parameters of the observation sequence are averaged and segmented. Finally, the parameters belonging to a certain segment in all observation sequences are constructed into a large matrix. Then, the Voicebox toolbox is called, specifically the k-means clustering function, to calculate the mean, variance, and weight coefficients of each PDF based on the clustering results. 7) Main program training. The function baum.m mentioned above only completes one iteration. In actual operation, multiple iterations are needed to obtain ideal results. It is important to note that conditions for ending the iteration must be set. In this study, the output probabilities of all observation sequences are calculated and accumulated to obtain the sum of the output probabilities. If the relative change in probability is small and its value reaches a certain set value, the iteration automatically ends. Furthermore, the maximum number of iterations can be set as a constant. If the number of iterations exceeds this constant, the iteration ends.8) In the program, the parameter input consists of two parts. The structure array `samples` contains information about the observed sequence. It should be noted that each `samples(k)` includes two members: `samples(k).wave` and `samples(k).data`, representing the original speech and parameters in the observed sequence, respectively. Before the call, the value of `data` needs to be calculated, which can also be calculated in the `train` program. It should be noted that the array `M` includes the Gaussian mixture factor corresponding to each state. Using the training function `train.m` and the recognition function `viterbi.m`, the observed sequence can be trained and recognized. A script file `mian.m` in the selection program assumes that the original speech signal is already stored in the cell array `samples`, where `samples{i}{1:K}` represents the K speech signals of the i-th word. In the loop program, these are imported into the `samples` array and then passed to the function `train` for training. The recognition result `hmm` is represented as a cell array, where a single `hmm` structure corresponds to a single element. Attached Figure Description

[0006] Figure 1 This demonstrates the interface design portion of GUI design, laying the groundwork for interface controls.

[0007] Figure 2 To set the string property using the property editor.

[0008] Figure 3 To adjust the color of the control.

[0009] Figure 4 To configure each required control in turn until all are completed.

[0010] Figure 5 To start recording, run voice_recg.m in MATLAB.

[0011] Figure 6 To save the recording.

[0012] Figure 7 This is a waveform diagram of the recorded speech.

[0013] Figure 8 Endpoint detection for recording.

[0014] Figure 9 To use the HMM algorithm for recognition, click on the training codebook.

[0015] Figure 10 This refers to the HMM training process.

[0016] Figure 11The files are hmm.mat and file_09.mat, representing the training results of the HMM.

[0017] Figure 12 Displaying the HMM recognition results.

[0018] Figure 13 The DTW algorithm is used as the recognition algorithm, and the speech training codebook is used.

[0019] Figure 14 This describes the DTW algorithm process.

[0020] Figure 15 The results of DTW recognition are displayed. Detailed Implementation

[0021] Because traditional computer system operations via command-line interfaces were complex, the convenience of graphical user interfaces (GUIs) has made them more popular. A graphical user interface (GUI) is a user interface that intuitively displays computer operations using graphics. Using MATLAB's GUI, operation is very simple, automating speech processing and meeting practical needs. In other words, during actual operation, it can perform signal processing on real-time speech, thereby automatically reading and preprocessing speech, and also achieving speech conversion and recognition. There are many ways to create a GUI. MATLAB's GUIDE can be launched in several ways: 1) Select "Open" in the toolbar; 2) Select MATLAB > GUIDE (GUI Builder) in the Start menu; 3) Select New > Graphical User Interface in the MATLAB HOME menu. Simply drag and drop objects to the destination to quickly build the entire GUI interface; 4) Type "guide" at the command prompt. It should be noted that this research is primarily based on the third method.

[0022] The steps for developing and designing the GUI interface based on the MATLAB simulation software platform are as follows: Step 1: Open the GUIDE panel, select the Blank GUI (Default) command, and click the OK button. A grid-background interface will appear; subsequent tasks will be completed on this panel. For example, you can set the color, size, and visibility of each module. If you create other modules, there is a column of module icons on the left side of this panel; click the corresponding button and place it in the appropriate position on this panel. Step 2: Define the required objects, such as text, buttons, and display boxes needed during program development. Drag and drop them to their corresponding positions, name them, and ideally adjust and connect the callbacks to achieve the desired simulation system. This interface design can implement functions such as voice recording, voice storage and retrieval, reading folder recordings, saving processed voice, displaying voice spectrum, and voice recognition and conversion. First, select the voice, then open it. In this case, the voice can be displayed according to the function. Through corresponding processing, this includes observing the original spectrum, observing the processed spectrum, real-time acquisition, and reading folder selections.

Claims

1. This system is a speech recognition system based on MATLAB, characterized in that... include: This paper utilizes MATLAB to achieve isolated speech signal recognition. A Chinese digit speech recognition system is established using Hidden Markov Models (HMM) as the primary recognition algorithm and MFCC as the main speech feature parameters. The recognition results are compared with those of the HMM algorithm using DTW as a comparison algorithm. Both algorithms include speech signal preprocessing, feature parameter extraction, template training, and matching. Furthermore, a graphical user interface for the speech recognition system was developed using MATLAB. The interface is simple, easy to use, and aesthetically pleasing. Statistical analysis shows that the recognition effect is significant, achieving a recognition rate of over 90% in experimental conditions.

2. The MATLAB-based speech recognition system according to claim 1, characterized in that... The DTW algorithm addresses isolated word speech recognition, effectively solving the matching problem for speech lengths of varying lengths. It's important to note that for recognizing single words, under identical conditions, there's little difference in results between the DTW and HMM algorithms. However, the latter's computational method is more cumbersome, primarily in the training phase. The latter requires collecting a large amount of speech data and performing complex calculations to obtain the corresponding model parameters. In this case, the DTW algorithm requires no additional computation. Therefore, the DTW model's greatest advantage lies in its convenience; in other words, it can be effectively applied to isolated word recognition. However, it also has limitations, namely, it cannot be trained into an effective framework using statistical methods. Furthermore, it cannot readily utilize bottom and top-level knowledge for recognition, exhibiting significant shortcomings compared to the HMM algorithm when dealing with large vocabulary, continuous speech, and speaker-independent speech.

3. The MATLAB-based speech recognition system according to claim 1, characterized in that... The Hidden Markov Model (HMM) treats speech as a series of specific states. These states are not directly observable (e.g., a state feature could be a phoneme feature) and are implicitly associated with the observables (or features) of the speech. This implicit relationship is often expressed probabilistically in the HMM model, and the model's output is also given in probabilistic form. It unifies the acoustic and phonological layers in speech recognition algorithms, establishing optimal search and matching algorithms to combine information from the acoustic and phonological layers in a probabilistic manner.

4. The MATLAB-based speech recognition system according to claim 1, characterized in that... The mathematical computation capabilities of the MATLAB simulation system are used to achieve the goal of recognizing isolated speech signals. MATLAB simulation software includes a rich set of toolkits, enabling real-time feedback during data processing. It's worth noting that the MATLAB simulation toolkits utilize advanced algorithms. Compared to traditional implementation methods, this platform allows for the development of speech recognition systems through a graphical interface. Furthermore, the speech recognition process and results can be visually demonstrated during the development and design process.

5. The MATLAB-based speech recognition system according to claim 1, characterized in that... To implement the DTW algorithm in MATLAB, the matrices are first defined as follows: d and D represent n*m matrices, representing the frame matching distance and cumulative distance, respectively, while n and m represent the number of frames in the test template and reference template, respectively. Next, the frame matching distance matrix d between the two templates is calculated iteratively. After these steps, dynamic programming is implemented, assigning each grid point (i,j) the cumulative distances 1D, 2D, and 3D of the three feasible preceding grid points. Furthermore, boundary conditions should be considered to prevent the occurrence of situations where preceding grid points do not exist; in this case, additional criteria need to be added. After these steps, the minimum function min is used to compare the cumulative distances, taking the smallest value, and adding it to the current frame's matching distance d(i,j). This determines the cumulative distance of the current grid point. The above steps are repeated until grid point (n,m), at which point the value of D(n,m) is output. In this case, it can be considered the result of matching the speech signal template library. By running the main function dtw.m, MATLAB will automatically read the WAV format audio files from the corresponding folder, perform endpoint detection, calculate the nonlinear prediction coefficients (MFCC), and save them as reference and test libraries. Then, the DTW algorithm is used to match the audio files and calculate their dist distance.

6. The MATLAB-based speech recognition system according to claim 1, characterized in that... Implementing the HMM algorithm in MATLAB begins with initializing the HMM parameters, primarily using the function `initmm.m`. Initialization involves determining the initial probability `hmm.init`, defining it as an array where the first element is 1 and all others are 0, with a length equal to the number of HMM states, N. Finally, the mean, variance, and weight coefficients of the Gaussian mixture function for each state are initialized. It's important to note that in this study, the parameters of the observation sequence are averaged based on the number of states, ultimately constructing a large matrix of parameters belonging to a specific segment from all observation sequences. The Voicebox toolbox is then called, specifically the k-means clustering function, to calculate the mean, variance, and weight coefficients of each PDF based on the clustering results. The parameter input consists of two parts: the `samples` array contains information about the observation sequence. Each `samples(k)` array includes two members: `samples(k).wave` and `samples(k).data`, representing the original speech and parameters from the observation sequence, respectively. Before the call is executed, the value of `data` needs to be calculated, and this value can also be calculated in the `train` program. It should be noted that array `M` includes the Gaussian mixture number corresponding to each state.