Method, apparatus and computer device for identifying voice and storage medium
A technology for speech recognition and to-be-recognized, which is applied in speech recognition, speech analysis, neural learning methods, etc., and can solve problems such as insufficient accuracy of speech recognition, and achieve the goal of improving the speed of recognition, improving the accuracy of recognition, and improving the accuracy Effect
Active Publication Date: 2018-01-26
PING AN TECH (SHENZHEN) CO LTD
21 Cites 13 Cited by
AI-Extracted Technical Summary
Problems solved by technology
[0004] Based on this, it is necessary to address the problem of insufficient speech recognition accuracy, the pr...
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View moreMethod used
In the present embodiment, carry out speech recognition by combining the long-short time recursive neural network LSTM in mixed Gaussian model GMM and deep learning model, first adopt GMM-HMM model to obtain the first likelihood probability according to the MFCC feature calculation of extracting matrix, the first likelihood probability matrix represents the alignment result of the speech data on the phoneme state, and then use LSTM to perform further alignment on the basis of the previous preliminary alignment results, and the LSTM uses an innovative LSTM with connected units model, this model adds a connection unit between the layers of the traditional LSTM model, the connection unit can control the information flow between the layers, and the effective information can be screened through the connection unit, which can not only improve the speed and improve the recognition accuracy.
In the present embodiment, in order to obtain more accurate identification effect, obtain initial alignment result (first likelihood probability matrix) through the GMM-HMM model after training, then carry out further through the DNN-HMM after training Alignment can get better alignment effect. Since the deep neural network model can obtain better speech feature expression than the traditional mixed Gaussian model, further forced alignment using the deep neural network model can further improve the accuracy rate. Then the result of further alignment (the second likelihood probability matrix) is substituted into the innovative LSTM-HMM model with connected units, and the final alignment result (third likelihood probability matrix) can be obtained. It should be noted that the alignment result here refers to the alignment relationship between speech frames and phoneme states. The above-mentioned mixed Gaussian model or deep learning model are all part of the acoustic model, and the function of the acoustic model is to obtain the alignment relationship between the speech frame and the phoneme state, which is convenient for the subsequent combination of the language model in the phoneme decoding network to obtain the speech data to be recognized corresponding target word sequence.
[0053] In the present embodiment, Filter Bank (filter bank) feature and MFCC (Mel frequency cepstrum coefficient, Mel cepstrum coefficient) feature are all parameters used to represent speech features in speech recognition. Among them, FilterBank is used for deep learning models, and MFCC is used for mixed Gaussian models. Before extracting the FilterBank features and MFCC features in the voice data, it is generally necessary to preprocess the voice data. Specifically, the input speech data is first pre-emphasized, and the high-frequency part of the speech signal is enhanced by using a high-pass filter to make the spectrum smoother, and then the pre-emphasized speech data is framed and windowed, so that The non-stationary speech signal is converted into a short-term stationary signal, and then the speech and noise are distinguished through endpoint detection, and the effective speech part is extracted. In order to extract the Filter Bank features and MFCC features in the speech data, first, the preprocessed speech data is subjected to fast Fourier transform, so that the speech ...
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View moreAbstract
The invention provides a method for identifying voice. The method includes the following steps: acquiring to-be-identified voice data; extracting the Filter Bank feature and the MFCC feature in the voice data; configuring the MFCC feature as the input data of a GMM-HMM model, acquiring a first approximation probability matrix; configuring the Filter Bank feature as the input feature of the LSTM model having a connecting unit, acquiring a posterior probability matrix; and configuring the posterior probability matrix and the first likelihood probability probability matrix as the input data of the HMM model, acquiring a second likelihood probability matrix, based on the second likelihood probability matrix, acquiring a corresponding target word sequence from a phoneme decoding network. The method herein combines a mixed gaussian model and a deep learning model and is provide with the LSTM model which has the connecting unit and serves as the acoustic model, so that the method increases the accuracy in identifying voice. The invention provides an apparatus for identifying voice, a computer device and a storage medium.
Application Domain
Technology Topic
Image
Examples
- Experimental program(1)
Example Embodiment
[0046] In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.
[0047] like figure 1 Shown is a schematic diagram of the internal structure of a computer device in an embodiment. The computer device may be a terminal or a server. refer to figure 1 , the computer equipment includes a processor connected through a system bus, a non-volatile storage medium, an internal memory, a network interface, a display screen and an input device. Wherein, the non-volatile storage medium of the computer device may store an operating system and computer-readable instructions, and when the computer-readable instructions are executed, the processor may execute a voice recognition method. The processor of the computer equipment is used to provide computing and control capabilities and support the operation of the entire computer equipment. Computer-readable instructions may be stored in the internal memory, and when the computer-readable instructions are executed by the processor, the processor may execute a voice recognition method. The network interface of the computer device is used for network communication. The display screen of the computer equipment may be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment may be a touch layer covered on the display screen, or a button, a trackball or a touch pad provided on the casing of the computer equipment, or It can be an external keyboard, touchpad or mouse. The touch layer and the display form a touch screen. Those skilled in the art can understand that, figure 1 The structure shown in is only a block diagram of a part of the structure related to the solution of this application, and does not constitute a limitation to the computer equipment on which the solution of this application is applied. The specific computer equipment may include more than shown in the figure or Fewer parts, or combining certain parts, or having a different arrangement of parts.
[0048] First, introduce the framework of speech recognition, such as figure 2As shown, speech recognition mainly includes two parts: an acoustic model and a language model, and then a framework of speech recognition is formed by combining a dictionary. The process of speech recognition is the process of converting the input speech feature sequence into a character sequence according to the dictionary, acoustic model and language model. Among them, the role of the acoustic model is to obtain the mapping between phonetic features and phonemes, the function of the language model is to obtain the mapping between words and words, and words and sentences, and the function of the dictionary is to obtain the mapping between words and phonemes. The specific speech recognition process can be divided into three steps. The first step is to recognize the speech frame as a phoneme state, that is, to align the speech frame and the phoneme state. The second step is to combine the states into phonemes. The third step is to combine phonemes into words. Among them, the first step is the role of the acoustic model, which is the key point and also the difficulty. The more accurate the alignment result between the speech frame and the phoneme state, the better the speech recognition effect will be. Among them, a phoneme state is a more detailed phonetic unit than a phoneme, and usually a phoneme is composed of three phoneme states.
[0049] like image 3 As shown, in one embodiment, a speech recognition method is proposed, which can be applied to a terminal or a server, and specifically includes the following steps:
[0050] Step 302, acquiring voice data to be recognized.
[0051] In this embodiment, the voice data to be recognized here is usually the audio data input by the user obtained through the interactive application, including digital audio and text audio.
[0052] Step 304, extracting Filter Bank features and MFCC features in the voice data.
[0053] In this embodiment, the Filter Bank (filter bank) feature and the MFCC (Mel frequency cepstrum coefficient, Mel cepstrum coefficient) feature are both parameters used to represent speech features in speech recognition. Among them, FilterBank is used for deep learning models, and MFCC is used for mixed Gaussian models. Before extracting the FilterBank features and MFCC features in the voice data, it is generally necessary to preprocess the voice data. Specifically, the input speech data is first pre-emphasized, and the high-frequency part of the speech signal is enhanced by using a high-pass filter to make the spectrum smoother, and then the pre-emphasized speech data is framed and windowed, so that The non-stationary speech signal is converted into a short-term stationary signal, and then the speech and noise are distinguished through endpoint detection, and the effective speech part is extracted. In order to extract the Filter Bank features and MFCC features in the speech data, first, the preprocessed speech data is subjected to fast Fourier transform, so that the speech signal in the time domain is converted into an energy spectrum in the frequency domain for analysis, and then the energy spectrum Through a set of Mel-scale triangular filter banks, the formant features of speech are highlighted, and then the logarithmic energy output by each filter bank is calculated. The feature output by the filter bank is the Filter Bank feature. Further, the calculated logarithmic energy is subjected to discrete cosine transform to obtain MFCC coefficients, that is, MFCC features.
[0054] Step 306, using the MFCC features as input data of the trained GMM-HMM model, and obtaining the first likelihood probability matrix output by the trained GMM-HMM model.
[0055] In this embodiment, the acoustic model and the language model jointly implement speech recognition. Among them, the role of the acoustic model is to identify the alignment relationship between speech frames and phoneme states. The GMM-HMM model is part of the acoustic model and is used to initially align speech frames with phoneme states. Specifically, the extracted MFCC feature of the speech data to be recognized is used as the input data of the trained GMM-HMM model, and then the likelihood probability matrix output by the model is obtained, which is referred to as "the first Likelihood Probability Matrix". The likelihood probability matrix represents the alignment relationship between the speech frame and the phoneme state, that is, the alignment relationship between the speech frame and the phoneme state can be obtained according to the calculated likelihood probability matrix, but the alignment obtained through GMM-HMM training The relationship is not very accurate, so the first likelihood probability matrix here is equivalent to a preliminary alignment of the speech frame and the phoneme state. The specific calculation formula of the GMM model is as follows:
[0056]
[0057] Among them, x represents the extracted speech feature (MFCC) vector, μ, D are the mean and variance matrix respectively, and K represents the order of MFCC coefficients.
[0058] Step 308, using the Filter Bank feature as the input feature of the trained LSTM model with connected units to obtain the posterior probability matrix output by the LSTM model with connected units, the connected units are used to control the flow of information between layers in the LSTM model .
[0059] In this embodiment, the LSTM model belongs to the deep learning model and is also a part of the acoustic model. The LSTM with connection unit is an innovative model proposed on the basis of the traditional LSTM model. This model adds a connection unit between the layers of the traditional LSTM model, through which the connection unit can control the relationship between layers. The flow of information, so the screening of effective information can be realized through this connection unit, and the level of LSTM model training can be made deeper through this connection unit. The more layers, the better the feature expression obtained and the better the recognition effect. Therefore, the LSTM model with connected units can not only improve the speed of speech recognition, but also improve the accuracy of speech recognition. Specifically, the connection unit is realized by the sigmoid function. The principle is to pass the output of the previous layer through a threshold formed by the sigmoid function to control the information flowing into the next layer, that is, the output is used as the input of the LSTM network of the next layer. The value of this sigmoid function is determined by the state of the neuron node of the previous layer, the output of the neuron node of the previous layer, and the input of the neuron node of the next layer. Among them, the neuron node is responsible for the calculation expression of the neural network model, and each node contains some calculation relations, which can be understood as a calculation formula, which can be the same or different. The number of neuron nodes in each layer of LSTM is determined according to the number of frames and feature vectors of the input features. For example, if the input is stitched with 5 frames before and after, then there are 11 frames of input vectors in total, and each frame The corresponding feature vector is determined by the extracted speech features. For example, if the extracted Filter Bank feature is an 83-dimensional feature vector, then the number of neuron nodes in each layer of the corresponding trained LSTM model is 11x83=913.
[0060] Step 310, using the posterior probability matrix and the first likelihood probability matrix as input data of the trained HMM model, and obtaining an output second likelihood probability matrix.
[0061] In this embodiment, the HMM (hidden Markov) model is a statistical model, which is used to describe a Markov process containing hidden unknown parameters, and its role is to determine the hidden parameters in the process from the observable parameters . The HMM model mainly involves 5 parameters, namely 2 state sets and 3 probability sets. Among them, the two state sets are hidden state and observed state respectively, and the three probability sets are initial matrix, transition matrix and confusion matrix. Wherein, the transfer matrix is obtained through training, that is, once the HMM model training is completed, the transfer matrix is determined. In this embodiment, observable speech features (Filter Bank features) are mainly used as observation states to calculate and determine the correspondence between phoneme states and speech frames (ie, implicit states). If you want to determine the correspondence between the phoneme state and the speech frame, you need to determine two parameters, that is, the initial matrix and the confusion matrix. Among them, the posterior probability matrix calculated by the LSTM model with connected units is the confusion matrix that needs to be determined in the HMM model, and the first likelihood probability matrix is the initial matrix that needs to be determined. Therefore, using the posterior probability matrix and the first likelihood probability matrix as the input data of the trained HMM model, the output second likelihood probability matrix can be obtained. The second likelihood probability matrix represents the final alignment relationship between phoneme states and speech frames. Subsequently, according to the determined second likelihood probability matrix, the target word sequence corresponding to the speech data to be recognized can be obtained in the phoneme decoding network.
[0062] Step 312: Acquire the target word sequence corresponding to the speech data to be recognized in the phoneme decoding network according to the second likelihood probability matrix.
[0063] In this embodiment, the speech recognition process includes two parts, one is an acoustic model, and the other is a language model. Before speech recognition, it is first necessary to build a phoneme-level decoding network based on the trained acoustic model, language model and dictionary, and to find the best path in the network according to the search algorithm, where the search algorithm can use the Viterbi algorithm ( Viterbi algorithm). This path is to be able to output the word string corresponding to the voice data to be recognized with the greatest probability, so that the text contained in the voice data is determined. Among them, the decoding network at the phoneme decoding level (that is, the phoneme decoding network) is completed through finite state machine (Finite State Transducer, FST) related algorithms, such as determination algorithm determination, minimization algorithm minimization, by splitting sentences into words , and then split the words into phonemes (such as Chinese consonants, English phonetic symbols), and then perform alignment calculations on the phonemes, pronunciation dictionaries, grammar, etc. by the above method, and obtain the output phoneme decoding network. The phoneme decoding network contains all possible recognized path expressions. The decoding process is to delete the path of this huge network according to the input voice data, and obtain one or more candidate paths, which are stored in a word network data structure. , and then the final recognition is to score the candidate paths, and the path with the highest score is the recognition result.
[0064] In this embodiment, by combining the mixed Gaussian model GMM and the long-short-term recurrent neural network LSTM in the deep learning model for speech recognition, the GMM-HMM model is first used to calculate the first likelihood probability matrix based on the extracted MFCC features, and the second A likelihood probability matrix represents the alignment result of the speech data on the phoneme state, and then uses LSTM to perform further alignment on the basis of the previous preliminary alignment result, and the LSTM uses an innovative LSTM model with connected units. The model adds a connection unit between the layers of the traditional LSTM model. This connection unit can control the flow of information between layers. Through this connection unit, the effective information can be screened, which can not only improve the speed of recognition, And the accuracy of recognition can be improved.
[0065] like Figure 4 As shown, in one embodiment, the connection unit is a sigmoid function; the FilterBank feature is used as the input feature of the LSTM model with the connection unit after training, and the output of the LSTM model with the connection unit is obtained. An empirical probability matrix, the connection unit is used to control the flow of information between layers in the LSTM model, including:
[0066] In step 308a, the Filter Bank feature is used as the input feature of the trained LSTM model with connected units.
[0067] Step 308b, according to the state and output of the neuron node of the previous layer in the LSTM model and the input of the neuron node of the next layer, determine the sigmoid function value corresponding to the connection unit between the layers;
[0068] Step 308c, according to the sigmoid function value corresponding to the connection unit between the layers, output the posterior probability matrix corresponding to the FilterBank feature.
[0069] In this embodiment, the connection unit is implemented by using a sigmoid function. In the LSTM model, the sigmoid function is used to control the flow of layer-to-layer information, for example, to control whether and how much to flow. Wherein, the determination of the function value corresponding to the sigmoid function is determined by the state of the neuron node of the previous layer, the output of the neuron node of the previous layer and the input of the neuron node of the next layer. Specifically, the sigmoid function is expressed as: σ(x)=1/(1+e -x ), Among them, X represents the input of the connection unit in this layer, t represents the time t, d represents the output of the connection unit, l represents the previous layer of the connection unit, l+1 represents the next layer of the connection unit, and b represents the bias Set items, W represents the weight matrix, where, W x is the weight matrix associated with the input, W c is the weight matrix associated with the output, W l is the weight matrix related to the level, c represents the output of the LSTM output controllers, LSTM has three threshold controls, input control gate, forgetting control gate, output control gate, the function of the output control gate is to control the output flow of the neuron node . is an operator that multiplies corresponding elements of two matrices. Among them, the values of the bias item b and the weight matrix W have been determined after the model is trained, so according to the input, it can be determined how much information flows between layers. After determining the information flow between layers, The output posterior probability matrix corresponding to the Filter Bank feature can be obtained.
[0070] like Figure 5 As shown, in one embodiment, the step 304 of extracting the Filter Bank feature and MFCC feature in the voice data includes:
[0071] Step 304A, performing Fourier transform on the voice data to be recognized into an energy spectrum in the frequency domain.
[0072] In this embodiment, since the transformation of speech signals in the time domain is usually difficult to see the characteristics of the signal, it is usually necessary to convert it into an energy distribution in the frequency domain for observation. Different energy distributions represent different speech characteristics. . Therefore, it is necessary to perform fast Fourier transform on the speech data to be recognized to obtain the energy distribution on the frequency spectrum. Wherein, the frequency spectrum of each frame is obtained by performing fast Fourier transform on each frame of the speech signal, and the power spectrum (ie, the energy spectrum) of the speech signal is obtained by taking the modulo square of the spectrum of the speech signal.
[0073] In step 304B, the energy spectrum in the frequency domain is used as the input feature of the Mel-scale triangular filter bank, and the Filter Bank feature of the speech data to be recognized is calculated.
[0074] In this embodiment, in order to obtain the Filter Bank feature of the speech data to be recognized, it is necessary to use the obtained energy spectrum in the frequency domain as the input feature of the Mel-scale triangular filter bank, and calculate the logarithm of each triangular filter bank output Energy, that is, to obtain the Filter Bank feature of the voice data to be recognized. Among them, the Filter Bank feature is also obtained by using the energy spectrum corresponding to each frame of speech signal as the input feature of the Mel-scale triangular filter bank, and then obtaining the Filter Bank feature corresponding to each frame of speech signal.
[0075] In step 304C, the Filter Bank feature is subjected to discrete cosine transform to obtain the MFCC feature of the speech data to be recognized.
[0076] In this embodiment, in order to obtain the MFCC features of the speech data to be recognized, it is also necessary to perform discrete cosine transform on the logarithmic energy outputted through the filter bank to obtain the corresponding MFCC features. The MFCC feature corresponding to each frame of speech signal is obtained by discrete cosine transforming the Filter Bank feature corresponding to each frame of speech signal. Among them, the difference between the FilterBank feature and the MFCC feature is that the Filter Bank feature has data correlation between different feature dimensions, while the MFCC feature is a feature obtained by using discrete cosine transform to remove the data correlation of the Filter Bank feature.
[0077] like Image 6 As shown, in one embodiment, the Filter Bank feature is used as the input feature of the LSTM model with connected units after training, and the posterior probability matrix output by the LSTM model with connected units is obtained, and the connected unit uses The step 308 for controlling the flow of information between layers in the LSTM model includes:
[0078] Step 308A, acquire the Filter Bank features corresponding to each frame of voice data in the voice data to be recognized and sort them by time.
[0079] In the present embodiment, when extracting the Filter Bank feature in the voice data to be recognized, the voice data is firstly processed into frames, and then the Filter Bank feature corresponding to each frame of voice data is extracted, and sorted according to the order of time, That is, sort the FilterBank features of each corresponding frame according to the order in which each frame appears in the voice data to be recognized.
[0080] Step 308B, use each frame of voice data and the Filter Bank feature of the preset number of frames before and after the frame as the input features of the trained LSTM model with connection units, control the flow of information between layers through the connection units, and obtain The posterior probability on the phoneme state corresponding to each frame of speech data output.
[0081] In this embodiment, the input of the deep learning model uses multi-frame features, which is more advantageous than the traditional mixed Gaussian model with only single-frame input, because it is beneficial to obtain context-related information by splicing the speech frames before and after the current speech frame. influences. Therefore, each frame of speech data and the FilterBank feature of the preset number of frames before and after each frame of speech data are generally used as the input features of the trained LSTM model with connected units. For example, the current frame and the 5 frames before and after the current frame are spliced, and a total of 11 frames of data are used as the input features of the trained LSTM model with connected units. point, output the posterior probability on the phoneme state corresponding to the frame of speech data.
[0082] Step 308C: Determine the posterior probability matrix corresponding to the speech data to be recognized according to the posterior probability corresponding to each frame of speech data.
[0083] In this embodiment, after the posterior probability corresponding to each frame of speech data is obtained, the posterior probability matrix corresponding to the speech data to be recognized is determined. The posterior probability matrix is composed of individual posterior probabilities. Since the LSTM model with connection units can contain both time dimension information and hierarchical latitude information, compared with the previous traditional model that only has time dimension information, this model can better obtain the corresponding speech data to be recognized. Posterior probability matrix.
[0084] like Figure 7 As shown, in one embodiment, before the step of acquiring the speech data to be recognized, it also includes: Step 301, establishing a GMM-HMM model and an LSTM model with connected units. Specifically include:
[0085] Step 301A, using the training corpus to train the Gaussian mixture models GMM and HMM, determine the variance and mean value corresponding to the GMM model through continuous iterative training, and generate the trained GMM-HMM model according to the variance and mean value.
[0086] In this embodiment, the establishment of the GMM-HMM acoustic model adopts monophone training and triphone training in sequence, wherein the triphone training takes into account the influence of the current phoneme before and after the related phonemes, and can obtain a more accurate alignment effect, that is, can produce better recognition results. According to the characteristics and functions, triphone training generally adopts triphone training based on delta+delta-delta features, linear discriminant analysis + maximum likelihood linear feature conversion triphone training. Specifically, the speech features in the input training prediction library are first normalized, and the variance is normalized by default. The normalization of speech features is to eliminate the deviation caused by convolution noise such as telephone channels in feature extraction calculations. Then use a small amount of feature data to quickly obtain an initialized GMM-HMM model, and then determine the variance and mean value corresponding to the mixed Gaussian model GMM through continuous iterative training. Once the variance and mean value are determined, the corresponding GMM-HMM model is determined accordingly. up.
[0087] In step 301B, according to the MFCC features extracted from the training corpus, the likelihood probability matrix corresponding to the training corpus is obtained by using the trained GMM-HMM model.
[0088] In this embodiment, the speech data in the training anticipation library is used for training, and the MFCC features of the speech in the training corpus are extracted, and then used as the input features of the GMM-HMM model after the above training, and the output corresponding to the speech in the training corpus is obtained. Likelihood probability matrix. The likelihood probability matrix represents the alignment relationship between the speech frame and the phoneme state. The purpose of outputting the likelihood probability matrix through the trained GMM-HMM is to use it as the initial alignment relationship of the subsequent training deep learning model, so that the subsequent deep learning model can Get better deep learning results.
[0089] Step 301C, train the LSTM model with connected units according to the Filter Bank features and likelihood probability matrix extracted from the training prediction library, determine the weight matrix and bias matrix corresponding to the LSTM model with connected units, and according to the weight matrix and bias Matrix generation trained LSTM model with connected units.
[0090] In this embodiment, the above-mentioned alignment result calculated by GMM-HMM (that is, the likelihood probability matrix) and the original speech features are used as the input features of the LSTM model with connected units for training, where the original speech features used here are Filter Bank feature, compared with MFCC feature, Filter Bank feature has data correlation, so it has better speech feature expression. By training the LSTM model with connected units, the weight matrix and bias matrix corresponding to each layer of LSTM are determined. Specifically, LSTM with connection units is also one of the deep neural network models, and the neural network layers are generally divided into three categories: input layer, hidden layer and output layer, wherein the hidden layer has multiple layers. The purpose of training the LSTM model with connected units is to determine all the weight matrices and bias matrices in each layer and the corresponding number of layers. The training algorithm can use existing algorithms such as forward propagation algorithm and Viterbi algorithm, which is wrong here. The specific training algorithm is limited.
[0091] like Figure 8 As shown, in one embodiment, a kind of speech recognition method is proposed, and this method comprises the following steps:
[0092] Step 802, acquire voice data to be recognized.
[0093] Step 804, extracting Filter Bank features and MFCC features in the voice data.
[0094] Step 806, using the MFCC features as input data of the trained GMM-HMM model, and obtaining the first likelihood probability matrix output by the trained GMM-HMM model.
[0095] Step 808, using the Filter Bank feature and the first likelihood probability matrix as input data of the trained DNN-HMM model, and obtaining the second likelihood probability matrix output by the trained DNN-HMM model.
[0096] Step 810, using the Filter Bank feature as the input feature of the trained LSTM model with connected units, and obtaining the posterior probability matrix output by the LSTM model with connected units, the connected units are used to control the flow of information between layers in the LSTM model .
[0097] Step 812, using the posterior probability matrix and the second likelihood probability matrix as the input data of the trained HMM model, and obtaining the output third likelihood probability matrix.
[0098] Step 814: Obtain the target word sequence corresponding to the speech data to be recognized in the phoneme decoding network according to the third likelihood probability matrix.
[0099]In this embodiment, in order to obtain a more accurate recognition effect, the preliminary alignment result (the first likelihood probability matrix) is obtained through the trained GMM-HMM model, and then the trained DNN-HMM is used for further alignment. A better alignment effect can be obtained. Since the deep neural network model can obtain better speech feature expression than the traditional mixed Gaussian model, further forced alignment using the deep neural network model can further improve the accuracy rate. Then the result of further alignment (the second likelihood probability matrix) is substituted into the innovative LSTM-HMM model with connected units, and the final alignment result (third likelihood probability matrix) can be obtained. It should be noted that the alignment result here refers to the alignment relationship between speech frames and phoneme states. The above-mentioned mixed Gaussian model or deep learning model are all part of the acoustic model, and the function of the acoustic model is to obtain the alignment relationship between the speech frame and the phoneme state, which is convenient for the subsequent combination of the language model in the phoneme decoding network to obtain the speech data to be recognized corresponding target word sequence.
[0100] like Figure 9 As shown, in one embodiment, a kind of speech recognition device is proposed, and this device comprises:
[0101] An acquisition module 902, configured to acquire speech data to be recognized.
[0102] An extraction module 904, configured to extract Filter Bank features and MFCC features in the speech data.
[0103] The first output module 906 is configured to use the MFCC features as input data of the trained GMM-HMM model, and obtain a first likelihood probability matrix output by the trained GMM-HMM model.
[0104] The posterior probability matrix output module 908, using the Filter Bank feature as the input feature of the trained LSTM model with connected units, obtains the posterior probability matrix output by the LSTM model with connected units, and the connected units are used for Controlling information flow between layers in the LSTM model.
[0105] The second output module 910 is configured to use the posterior probability matrix and the first likelihood probability matrix as input data of the trained HMM model, and obtain an output second likelihood probability matrix.
[0106] The decoding module 912 is configured to obtain the target word sequence corresponding to the speech data to be recognized in the phoneme decoding network according to the second likelihood probability matrix.
[0107] In one embodiment, the extraction module is also used to perform Fourier transform on the speech data to be recognized into an energy spectrum in the frequency domain, and use the energy spectrum in the frequency domain as the input feature of the triangular filter bank of the Mel scale to calculate The Filter Bank feature of the voice data to be recognized is obtained, and the Filter Bank feature is subjected to discrete cosine transform to obtain the MFCC feature of the voice data to be recognized.
[0108] In one embodiment, the connection unit is a sigmoid function; the posterior probability matrix output module 908 is also used to use the Filter Bank feature as the input feature of the trained LSTM model with the connection unit; according to the LSTM The state and output of the previous layer of neuron nodes in the model and the input of the next layer of neuron nodes determine the sigmoid function value corresponding to the connection unit between the layers; according to the corresponding The value of the sigmoid function, output the posterior probability matrix corresponding to the Filter Bank feature.
[0109] like Figure 10 As shown, in one embodiment, the posterior probability matrix output module 908 includes:
[0110] The sorting module 908A is configured to acquire the Filter Bank features corresponding to each frame of voice data in the voice data to be recognized and sort them according to time.
[0111] The posterior probability output module 908B is used to use each frame of voice data and the Filter Bank features of the preset frame numbers before and after the frame as the input features of the trained LSTM model with a connection unit, and control it through the connection unit The information flows between layers, and the posterior probability on the phoneme state corresponding to each frame of speech data output is obtained.
[0112] The determination module 908C is configured to determine the posterior probability matrix corresponding to the speech data to be recognized according to the posterior probability corresponding to each frame of speech data.
[0113] like Figure 11 As shown, in one embodiment, the above-mentioned speech recognition device also includes:
[0114] The GMM-HMM model training module 914 is used to train the Gaussian mixture model GMM and HMM using the training corpus, determine the variance and mean value corresponding to the GMM model through continuous iterative training, and generate the trained GMM-HMM model according to the variance and mean value.
[0115] The likelihood probability matrix acquisition module 916 is configured to acquire the likelihood probability matrix corresponding to the training corpus by using the trained GMM-HMM model according to the MFCC features extracted from the training corpus.
[0116] The LSTM model training module 918 is used to train the LSTM model with connected units according to the Filter Bank feature and likelihood probability matrix extracted in the training anticipation library, and determine the weight matrix and offset matrix corresponding to the LSTM model with connected units. According to the weight matrix and bias matrix to generate the trained LSTM model with connected units.
[0117] like Figure 12 As shown, in one embodiment, a kind of speech recognition device is proposed, and this device comprises:
[0118] An acquisition module 1202, configured to acquire speech data to be recognized.
[0119] An extraction module 1204, configured to extract Filter Bank features and MFCC features in the speech data.
[0120] The first output module 1206 is configured to use the MFCC features as input data of the trained GMM-HMM model, and obtain a first likelihood probability matrix output by the trained GMM-HMM model.
[0121] The second output module 1208 uses the Filter Bank feature and the first likelihood probability matrix as the input data of the trained DNN-HMM model, and obtains the second likelihood probability matrix output by the trained DNN-HMM.
[0122] The posterior probability matrix output module 1210 is used to use the Filter Bank feature as the input feature of the LSTM model with connected units after training, and obtain the posterior probability matrix output by the LSTM model with connected units, the connected unit It is used to control the information flow between layers in the LSTM model.
[0123] The third output module 1212 is configured to use the posterior probability matrix and the second likelihood probability matrix as input data of the trained HMM model, and obtain an output third likelihood probability matrix.
[0124] The decoding module 1214 is configured to obtain the target word sequence corresponding to the speech data to be recognized in the phoneme decoding network according to the third likelihood probability matrix.
[0125] In one embodiment, a computer device is provided, the computer device includes a memory, a processor, and a computer program stored on the memory and operable on the processor, when the processor executes the computer program Realize following steps: obtain the voice data to be identified, extract the Filter Bank feature and the MFCC feature in the voice data, use the MFCC feature as the input data of the GMM-HMM model after training, obtain the GMM-HMM after the training The first likelihood probability matrix of the HMM model output, using the Filter Bank feature as the input feature of the LSTM model with the connected unit after training, and obtaining the posterior probability matrix output by the LSTM model with the connected unit, the connected The unit is used to control the flow of information between layers in the LSTM model, using the posterior probability matrix and the first likelihood probability matrix as the input data of the trained HMM model, and obtaining the output second likelihood A probability matrix, obtaining a target word sequence corresponding to the speech data to be recognized in the phoneme decoding network according to the second likelihood probability matrix.
[0126] In one embodiment, the connection unit is a sigmoid function; the filter bank feature executed by the processor is used as the input feature of the trained LSTM model with the connection unit, and the LSTM model with the connection unit is obtained. The posterior probability matrix output by the LSTM model, the connection unit is used to control the information flow between layers in the LSTM model, including: using the Filter Bank feature as the LSTM model with the connection unit after the training Input features; according to the state and output of the previous layer of neuron nodes in the LSTM model and the input of the next layer of neuron nodes to determine the corresponding sigmoid function value of the connection unit between the layers, according to the layer and the layer The sigmoid function value corresponding to the connection unit between, outputs the posterior probability matrix corresponding to the Filter Bank feature.
[0127] In one embodiment, the extraction of the Filter Bank feature and the MFCC feature in the speech data performed by the processor includes: performing Fourier transform conversion of the speech data to be recognized into energy in the frequency domain spectrum; the energy spectrum in the frequency domain is used as the input feature of the triangular filter bank of the Mel scale, and the Filter Bank feature of the voice data to be recognized is calculated; the Filter Bank feature is obtained through the Discrete Cosine Transformation of the voice data to be recognized MFCC characteristics.
[0128] In one embodiment, the processor executes using the Filter Bank feature as the input feature of the trained LSTM model with connected units to obtain the posterior probability matrix output by the LSTM model with connected units, so The step that the connection unit is used to control the flow of information between layers in the LSTM model includes: obtaining the Filter Bank feature corresponding to each frame of speech data in the speech data to be recognized and sorting according to time; each frame of speech data and The Filter Bank feature of the preset number of frames before and after the frame is used as the input feature of the LSTM model with the connection unit after the training, and the information flow between the layers is controlled by the connection unit to obtain each frame of output voice The posterior probability on the phoneme state corresponding to the data; determine the posterior probability matrix corresponding to the speech data to be recognized according to the posterior probability corresponding to each frame of speech data.
[0129] In one embodiment, before the step of obtaining the speech data to be recognized, the processor executes the computer program to implement the following steps: using the training corpus to train the Gaussian mixture model GMM and HMM, by Continuous iterative training determines the corresponding variance and mean value of the GMM model; generates a trained GMM-HMM model according to the variance and mean value; uses the trained GMM-HMM model to obtain To the likelihood probability matrix corresponding to the training corpus; According to the Filter Bank feature and the likelihood probability matrix extracted in the training expected library, the LSTM model with the connection unit is trained, and it is determined that it is related to the connection unit A weight matrix and a bias matrix corresponding to the LSTM model; a trained LSTM model with connected units is generated according to the weight matrix and bias matrix.
[0130]In one embodiment, a computer-readable storage medium is provided, on which computer instructions are stored, and the following steps are implemented when the instructions are executed by a processor: acquiring speech data to be recognized; extracting the FilterBank feature in the speech data and MFCC feature; the MFCC feature is used as the input data of the GMM-HMM model after training, obtains the first likelihood probability matrix of the GMM-HMM model output after the training; the Filter Bank feature is used as the trained With the input feature of the LSTM model of the connection unit, the posterior probability matrix of the output of the LSTM model with the connection unit is obtained, and the connection unit is used to control the information flow between layers in the LSTM model; The test probability matrix and the first likelihood probability matrix are used as the input data of the HMM model after training, and the second likelihood probability matrix of the output is obtained; according to the second likelihood probability matrix, in the phoneme decoding network, the same The target word sequence corresponding to the speech data to be recognized.
[0131] In one embodiment, the connection unit is a sigmoid function; the filter bank feature executed by the processor is used as the input feature of the trained LSTM model with the connection unit, and the LSTM model with the connection unit is obtained. The posterior probability matrix output by the LSTM model, the connection unit is used to control the information flow between layers in the LSTM model, including: using the Filter Bank feature as the LSTM model with the connection unit after the training Input features; according to the state and output of the previous layer of neuron nodes in the LSTM model and the input of the subsequent layer of neuron nodes to determine the corresponding sigmoid function value of the connection unit between the layers; according to the layer and the layer The sigmoid function value corresponding to the connection unit between, outputs the posterior probability matrix corresponding to the Filter Bank feature.
[0132] In one embodiment, the extraction of the Filter Bank feature and the MFCC feature in the speech data performed by the processor includes: performing Fourier transform conversion of the speech data to be recognized into energy in the frequency domain spectrum; the energy spectrum in the frequency domain is used as the input feature of the triangular filter bank of the Mel scale, and the Filter Bank feature of the voice data to be recognized is calculated; the Filter Bank feature is obtained through the Discrete Cosine Transformation of the voice data to be recognized MFCC characteristics.
[0133] In one embodiment, the processor executes using the Filter Bank feature as the input feature of the trained LSTM model with connected units to obtain the posterior probability matrix output by the LSTM model with connected units, so The step that the connection unit is used to control the flow of information between layers in the LSTM model includes: obtaining the Filter Bank feature corresponding to each frame of speech data in the speech data to be recognized and sorting according to time; each frame of speech data and The Filter Bank feature of the preset number of frames before and after the frame is used as the input feature of the LSTM model with the connection unit after the training, and the information flow between the layers is controlled by the connection unit to obtain each frame of output voice The posterior probability on the phoneme state corresponding to the data; determine the posterior probability matrix corresponding to the speech data to be recognized according to the posterior probability corresponding to each frame of speech data.
[0134] In one embodiment, before the step of obtaining the speech data to be recognized, the processor executes the computer program to implement the following steps: using the training corpus to train the Gaussian mixture model GMM and HMM, by Continuous iterative training determines the corresponding variance and mean value of the GMM model; generates a trained GMM-HMM model according to the variance and mean value; uses the trained GMM-HMM model to obtain To the likelihood probability matrix corresponding to the training corpus; According to the Filter Bank feature and the likelihood probability matrix extracted in the training expected library, the LSTM model with the connection unit is trained, and it is determined that it is related to the connection unit A weight matrix and a bias matrix corresponding to the LSTM model; a trained LSTM model with connected units is generated according to the weight matrix and bias matrix.
[0135] Those of ordinary skill in the art can understand that realizing all or part of the processes in the methods of the above embodiments can be completed by instructing related hardware through a computer program, and the computer program can be stored in a computer-readable storage medium. During execution, it may include the processes of the embodiments of the above-mentioned methods. Wherein, the aforementioned storage medium may be a nonvolatile storage medium such as a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM).
[0136] The technical features of the above-mentioned embodiments can be combined arbitrarily. To make the description concise, all possible combinations of the technical features in the above-mentioned embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, should be considered as within the scope of this specification.
[0137] The above-mentioned embodiments only express several implementation modes of the present invention, and the descriptions thereof are relatively specific and detailed, but should not be construed as limiting the patent scope of the invention. It should be pointed out that those skilled in the art can make several modifications and improvements without departing from the concept of the present invention, and these all belong to the protection scope of the present invention. Therefore, the protection scope of the patent for the present invention should be based on the appended claims.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more PUM


Description & Claims & Application Information
We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more Similar technology patents
Exposure apparatus and device fabrication method
InactiveUS20060012765A1Improve accuracyAvoid changeSemiconductor/solid-state device manufacturingPhotomechanical exposure apparatusVaporizationOptic system
Owner:NIKON CORP
Integrated vehicle motion control system
InactiveUS20050027402A1Improve accuracy and easeImprove accuracyVehicle testingRegistering/indicating working of vehiclesLayered structureSoftware
Owner:TOYOTA JIDOSHA KK
Apparatus for monitoring a person's heart rate and/or heart rate variation; wrist-watch comprising the same
InactiveUS20090048526A1Easy constructionImprove accuracyElectrocardiographyCatheterHigh heart rateWrist
Owner:KONINKLIJKE PHILIPS ELECTRONICS NV
Method for efficient storage of metadata in flash memory
ActiveUS20080270730A1Improve accuracyAccuracy is compromisedMemory architecture accessing/allocationProgram control using stored programsMetadataData store
Owner:WESTERN DIGITAL ISRAEL LTD
Classification and recommendation of technical efficacy words
- Improve the speed of recognition
- Improve accuracy
Logistics truck punctual arrival recognition method, device and equipment and storage medium
PendingCN112001258AShorten the timeImprove the speed of recognitionCharacter and pattern recognitionNeural architecturesTruckSample image
Owner:SHANGHAI DONGPU INFORMATION TECH CO LTD
Sanqi data processing system
InactiveCN108647298AImprove the speed of recognitionImprove accuracy and speedMarketingSpecial data processing applicationsSTEP-NCMarket forecast
Owner:亳州中药材商品交易中心有限公司
Finger vein recognition module and recognition system thereof
PendingCN114092975AIntegrity guaranteedImprove the speed of recognitionDiagnostic recording/measuringSensorsFinger vein recognitionLight filter
Owner:黄河科技集团信息产业发展有限公司
Method and system for detecting paint bump of enameled wire
ActiveCN113916940AFast recognitionImprove the speed of recognitionMaterial analysis by electric/magnetic meansProcess engineeringMechanical engineering
Owner:GUANGDONG JINGDA REA SPECIAL ENAMELED WIRE CO LTD
System and Method for Synthetic Interaction with User and Devices
ActiveUS20180114591A1Improve accuracyNatural language translationPhysical therapies and activitiesInteraction deviceMedical procedure
Owner:PRIBANIC VIRGINIA FLAVIN
Cassette-based dialysis medical fluid therapy systems, apparatuses and methods
InactiveUS20050209563A1Improvement for dialysisImprove accuracyMedical devicesPeritoneal dialysisAccuracy improvementDialysis
Owner:BAXTER INT INC +1
Golf club head with adjustable vibration-absorbing capacity
InactiveUS20050277485A1Improve grip comfortImprove accuracyGolf clubsRacket sportsEngineeringGolf club
Owner:FUSHENG IND CO LTD
Direct fabrication of aligners for palate expansion and other applications
ActiveUS20170007367A1Improve accuracyImproved strength , accuracyAdditive manufacturing apparatusMechanical/radiation/invasive therapiesOrthodontics
Owner:ALIGN TECH
Automated method and system for determining the state of well operations and performing process evaluation
InactiveUS6892812B2Efficiently determinedImprove accuracyElectric/magnetic detection for well-loggingSurveyAutomated methodProcess assessment
Owner:TDE PETROLEUM DATA SOLUTIONS