Speech keyword recognition method and apparatus

By processing audio features and generating directed cyclic graphs using LSTM neural networks, the problem of non-reusability of keyword models in existing technologies is solved, enabling flexible keyword recognition and efficient model reuse, and making it suitable for low-resource devices.

CN116524912BActive Publication Date: 2026-06-16SHANGHAI MOBVOI INFORMATION TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHANGHAI MOBVOI INFORMATION TECH CO LTD
Filing Date
2023-03-17
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

Existing speech keyword recognition methods cannot effectively reuse keyword models, which means that the model needs to be rebuilt when the keywords are changed, increasing costs and reducing recognition efficiency.

Method used

An LSTM neural network is used to process audio features to generate a directed cyclic graph. The optional paths are weighted by a posterior probability matrix to identify keywords, supporting flexible keyword replacement and model reuse.

🎯Benefits of technology

It enables model reuse when keywords are changed, reduces construction costs, improves recognition efficiency and accuracy, and is suitable for deployment on low-resource devices.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116524912B_ABST
    Figure CN116524912B_ABST
Patent Text Reader

Abstract

The disclosure provides a speech keyword recognition method and device, which first extracts features of an audio signal to be recognized to obtain audio features, inputs the audio features into an LSTM module to map to a high-dimensional feature space to obtain an output, then performs dimension transformation on the output of the LSTM module to obtain a posterior probability matrix, generates a directed cyclic graph of keywords, i.e., a keyword decoding graph, using the keywords, the directed cyclic graph contains optional paths of all keywords, and finally weights the optional paths in the directed cyclic graph according to the posterior probability matrix to obtain path scores of the keywords. Thus, even if the keywords are replaced, new keyword recognition can be performed by constructing a corresponding keyword graph and cooperating with the original acoustic model, model reuse is realized, the cost of model construction is reduced, the efficiency of keyword recognition is improved, and the recognition accuracy is relatively high.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of speech recognition technology, and in particular to a method and apparatus for speech keyword recognition. Background Technology

[0002] Keyword Spotting (KWS) is an important technology for human-machine voice interaction. Its goal is to quickly and accurately retrieve specified keywords from a continuous audio stream. This technology is widely used for controlling various smart devices and waking them up.

[0003] Among existing KWS methods, one is the fill-in-the-blank model, which treats KWS as a sequence labeling problem. It uses Hidden Markov Models (HMMs) or Neural Networks (NNs) to model keywords and uses additional category labels to match all non-keywords. Another is example-based KWS, which defines it as a matching problem and completes the retrieval task by calculating the similarity between keyword audio and non-keyword audio. Both of these methods are limited to modeling only specific keywords; if the keywords change, the model cannot be reused. Summary of the Invention

[0004] To address at least one of the aforementioned technical problems, this disclosure provides a method and apparatus for speech keyword recognition.

[0005] The first aspect of this disclosure proposes a speech keyword recognition method, comprising: extracting features from an audio signal to be recognized to obtain audio features; inputting the audio features into an LSTM module and mapping them to a high-dimensional feature space to obtain an output; performing a dimensionality transformation on the output of the LSTM module to obtain a posterior probability matrix; generating a keyword decoding graph of keywords, wherein the keyword decoding graph is a directed cyclic graph containing all possible paths of keywords; and weighting the possible paths in the directed cyclic graph according to the posterior probability matrix to obtain a path score for the keyword, thereby recognizing the keyword.

[0006] According to one embodiment of this disclosure, inputting the audio features into an LSTM module to obtain feature output includes: for each hidden node, obtaining the memory state of the current frame based on the audio features of the current frame; and obtaining the hidden information of the current frame based on the audio features of the current frame and the memory state of the current frame.

[0007] According to one embodiment of this disclosure, obtaining the memory state of the current frame based on the audio features of the current frame includes: obtaining forgotten information, updated information, and candidate information of the current frame based on the audio features of the current frame and the hidden information of the previous frame; obtaining retained information based on the forgotten information of the current frame and the memory state of the previous frame; and obtaining the memory state of the current frame based on the retained information, the updated information, and the candidate information of the current frame.

[0008] According to one embodiment of this disclosure, obtaining hidden information of the current frame based on the audio features of the current frame and the memory state of the current frame includes: obtaining the initial output of the current frame based on the audio features of the current frame and the hidden information of the previous frame; and obtaining the hidden information of the current frame based on the initial output of the current frame and the memory state of the current frame.

[0009] According to one embodiment of this disclosure, weighting the selectable paths in the directed cyclic graph based on the posterior probability matrix includes: for the posterior probability vector of each audio frame, determining a sub-state corresponding to each current state of the bundle from the directed cyclic graph, wherein the sub-state is a child node of the state in the directed cyclic graph; calculating the path score to the current sub-state based on the posterior probability vector; selecting at least some of the sub-states based on the path score and updating them in the bundle; pruning the sub-states in the bundle based on a pruning threshold and the path score; adding the initial state to the bundle, and using the sub-states in the bundle as the new current state, and weighting again, until the audio signal to be identified is completed; wherein, if the current state in the bundle satisfies a preset termination condition, the exp value of the path score of the current state is determined, and the keyword is identified when the exp value is greater than a preset confidence threshold.

[0010] According to one embodiment of this disclosure, the path score of the sub-state is calculated using the following formula:

[0011]

[0012] Where, cur_frame is the time frame at which the current state record is reached, start_frame is the time frame at which the keyword begins, and y t Let be the posterior probability of the label corresponding to the state at time t.

[0013] According to one embodiment of this disclosure, selecting at least some of the sub-states based on the path score and updating them in the bundle includes: if multiple sub-states have the same label, then selecting the sub-state with the highest path score to update the bundle.

[0014] According to one embodiment of this disclosure, the method for determining the pruning threshold includes: determining the path score of each current state in each time step bundle, determining the highest score from all the path scores of the current states, and using the difference between the highest score and a preset threshold as the pruning threshold.

[0015] According to one embodiment of this disclosure, the path score of the next time-state is the path score of the current time-substate.

[0016] A second aspect of this disclosure provides a speech keyword recognition device, comprising: a memory storing execution instructions; and a processor executing the execution instructions stored in the memory, causing the processor to perform the speech keyword recognition method described in any of the above embodiments. Attached Figure Description

[0017] The accompanying drawings illustrate exemplary embodiments of the present disclosure and, together with the description thereof, serve to explain the principles of the present disclosure. These drawings are included to provide a further understanding of the present disclosure and are incorporated in and constitute a part of this specification.

[0018] Figure 1 This is a flowchart illustrating a speech keyword recognition method according to one embodiment of the present disclosure.

[0019] Figure 2 This is a schematic diagram of a framing process according to one embodiment of the present disclosure.

[0020] Figure 3 This is a schematic diagram of a network framework for speech keyword recognition according to one embodiment of the present disclosure.

[0021] Figure 4 This is a schematic diagram illustrating the computation process of any LSTM node in an LSTM neural network according to one embodiment of the present disclosure.

[0022] Figure 5 This is a schematic diagram illustrating the generation of a training dataset according to one embodiment of the present disclosure.

[0023] Figure 6 This is a schematic diagram of a directed cyclic graph based on a key element of one embodiment of this disclosure.

[0024] Figure 7 This is a schematic diagram of a speech keyword recognition device employing a hardware implementation of a processing system according to one embodiment of the present disclosure. Detailed Implementation

[0025] The present disclosure will now be described in further detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are for illustrative purposes only and are not intended to limit the scope of the disclosure. Furthermore, it should be noted that, for ease of description, only the parts relevant to the present disclosure are shown in the accompanying drawings.

[0026] It should be noted that, where there is no conflict, the embodiments and features described in this disclosure can be combined with each other. The technical solutions of this disclosure will now be described in detail with reference to the accompanying drawings and embodiments.

[0027] Unless otherwise stated, the exemplary implementations / embodiments shown are to be understood as providing exemplary features of various details that provide ways in which the technical concepts of this disclosure can be implemented in practice. Therefore, unless otherwise stated, the features of various implementations / embodiments may be additionally combined, separated, interchanged and / or rearranged without departing from the technical concepts of this disclosure.

[0028] The terminology used herein is for the purpose of describing particular embodiments and is not restrictive. As used herein, unless the context clearly indicates otherwise, the singular forms “a” and “the” are intended to include the plural forms as well. Furthermore, when the terms “comprising” and / or “including” and variations thereof are used in this specification, it indicates the presence of the stated features, integrals, steps, operations, parts, components, and / or groups thereof, but does not exclude the presence or addition of one or more other features, integrals, steps, operations, parts, components, and / or groups thereof. It should also be noted that, as used herein, the terms “substantially,” “about,” and other similar terms are used as approximate terms rather than as terms of degree, thus explaining the inherent biases in measurements, calculated values, and / or provided values ​​that would be recognized by one of ordinary skill in the art.

[0029] The speech keyword recognition method and apparatus of this disclosure are described below with reference to the accompanying drawings.

[0030] Figure 1 This is a schematic flowchart of a speech keyword recognition method according to one embodiment of the present disclosure. Please refer to... Figure 1 The speech keyword recognition method M10 of this embodiment may include the following steps.

[0031] S100: Extract audio features from the audio signal to be identified.

[0032] The purpose of extracting audio features is to convert the audio signal to be recognized into a data state that can be input into a neural network. The extracted audio features can be F-bank (Filter Bank) features, thereby enhancing the performance of speech recognition.

[0033] Step S100 may include: sequentially performing frame segmentation, windowing, fast Fourier transform, and Mel filtering on the audio signal to be identified to obtain audio features.

[0034] Framing refers to dividing an audio signal into multiple audio frames. There can be overlapping areas between two adjacent audio frames. Figure 2 This is a schematic diagram of a framing process according to one embodiment of this disclosure. Please refer to... Figure 2 When dividing the audio signal AS to be identified into frames, the frame length F of the audio frame F is... L The frame time can be 25ms, and the frame offset can be 10ms, allowing for a smooth transition between audio frames. After segmenting the audio signal to be recognized into frames, the time-domain signal A of the segmented audio signal AS is obtained. The time-domain signal A corresponding to the audio signal AS is divided into the following form:

[0035]

[0036] Where n is the total number of frames, and each row in the time-domain signal A corresponds to one frame of the time-domain signal. m1 is the number of sampling points contained in one frame. m2 is the number of sampling points contained in the frame offset. a is the signal component, equivalent to the sampling points after discrete sampling of the continuous audio signal. If the audio sampling rate is 16kHz, then the frame length F L Includes 400 sampling points, frame offset F d It includes 160 sampling points, i.e., m1 = 400 and m2 = 160.

[0037] Understandably, before framing the audio signal to be recognized, pre-emphasis can be applied. Pre-emphasis refers to increasing the high-frequency components of the audio signal, for example, by using a first-order high-pass filter to improve the signal-to-noise ratio in the high-frequency range, thus flattening the signal's spectrum. After obtaining the pre-emphasis result, the signal is then framed.

[0038] Windowing refers to substituting each audio frame into a window function, or multiplying by the window function, to increase the continuity at both ends of the audio frame. The window function can be a Hamming window or a Hanning window. The window function for a Hamming window is:

[0039] Where t is the index in each frame of the signal, which is the time step of the discrete speech signal, L is the total number of samples in each frame of the audio signal, and W(t) is the window coefficient.

[0040] After the audio signal is segmented into frames, for the t-th audio frame, after applying a Hamming window, its time-domain signal is:

[0041]

[0042] Where i∈[1,n], A i Let be the time-domain signal of the i-th audio frame before windowing, and W be the window coefficient of the i-th frame. This is the Hadamard product operator, which represents the multiplication of corresponding elements of a matrix.

[0043] The i-th frame audio signal A after frame segmentation and windowing ′ i After performing a Fourier transform, the signal is converted to the frequency domain, where i∈[1,n]. The Fast Fourier Transform (FFT) refers to performing a fast Fourier transform on the frame signal to obtain the spectrum of each frame. Then, the power spectrum of the audio signal is obtained by taking the square of the modulus of the spectrum, thus converting the frame into an energy distribution in the frequency domain. Different energy distributions represent different characteristics of the audio signal.

[0044] Mel filtering refers to extracting frequency bands from the power spectrum using a Mel-scaled triangular filter bank. This smooths the spectrum, eliminates harmonics, and highlights the resonance peaks of the audio signal. The triangular filter bank contains multiple triangular filters. Then, by taking the logarithm of the extracted frequency bands, a G-dimensional Fbank audio feature is obtained, where G is the dimension of the obtained audio feature. If a 23-dimensional Fbank feature is obtained, then G = 23.

[0045] S200 maps audio features into the LSTM module and outputs them into a high-dimensional feature space.

[0046] LSTM (Long Short-Term Memory) neural networks are a type of recurrent neural network used for speech recognition. LSTM neural networks can retain historical information about the input signal, making them suitable for processing signals that are correlated over time. An LSTM neural network can include LSTM modules, which consist of multiple hidden layers, each with multiple hidden nodes.

[0047] Figure 3 This is a schematic diagram of a network framework for speech keyword recognition according to one embodiment of this disclosure. (See also...) Figure 3An LSTM neural network consists of an LSTM module and a CTC module. The LSTM module can include three hidden layers: h1, h2, and h3, each containing 64 hidden nodes. The h1 hidden layer has 23*64*4*2 weight parameters, the h2 hidden layer has 64*64*4*2 weight parameters, and the h3 hidden layer has 64*64*4*2 weight parameters. The input to the LSTM module is the audio features of each frame, and the output is the output of the last hidden layer, which is a 64-dimensional vector.

[0048] During the processing of the input feature vector by the LSTM neural network, the feature vector of the first frame of audio is input into the LSTM module to obtain the corresponding module output. This output is used both as input to the CTC loss function and as part of the input for the next audio frame, thus passing on historical information. From the second frame onwards until the last frame, the feature vector of each frame, along with the LSTM module output of the previous frame, is used as input to obtain the module output for the current frame.

[0049] The parameter table for the LSTM neural network is shown in Table 1 below.

[0050] Table 1 Parameter Table of LSTM Neural Network

[0051]

[0052] In Table 1, Normalize represents normalization, LSTM_1, LSTM_2 and LSTM_3 represent three LSTM modules, weight is the weight, bias is the bias parameter, input is the feature dimension of the input (e.g., 23 dimensions), batchsize is the number of input samples in a single training iteration, T represents the maximum time step, Linear1 is the linear layer, and CTC (Connectionist Temporal Classification) is the connectionist temporal classification layer.

[0053] Figure 4 This is a schematic diagram illustrating the computation process of any LSTM node in an LSTM neural network according to one embodiment of the present disclosure. (See also...) Figure 4 Assume the audio feature of the t-th frame is x. t Then x t =(x1,x2,…,x G The audio feature x can be normalized first using a normalization module. t Normalization is performed to obtain the normalized audio features. Then Input the LSTM module and get the output u.

[0054]

[0055] The output u is Figure 3 The content output from the LSTM module to the CTC module is the aforementioned 64-dimensional vector.

[0056] The process of operating on hidden nodes in LSTM can include the following steps.

[0057] S210, for each hidden node, based on the audio features of the current frame... Get the memory state C of the current frame t .

[0058] S220, based on the audio characteristics of the current frame and the memory state C of the current frame t Obtain the hidden information h of the current frame t .

[0059] For nodes other than the first hidden node, the input to the hidden node includes the normalized audio features of the current frame. Hidden information h from the previous frame t-1 Memory state C of the previous frame t-1 The output is the memory state C of the current frame. t and the hidden information h of the current frame t The output is the current frame memory state C. t and hidden information h t The information is input into the next hidden node to update the memory state and perform hidden information calculations, until the calculations at each hidden node of the LSTM module are completed, and the audio features of the audio of the t-th frame are output by the LSTM module. Figure 4 In the middle, on the timeline, the hidden information h at time t-1 t-1 and memory state C t-1 It will be passed to time t.

[0060] Specifically, S210 may include the following steps.

[0061] S211, based on the audio features x of the current frame t And the hidden information h from the previous frame t-1 Obtain the forgetting information f of the current frame t Update information i t and candidate information

[0062] Forgot information f of the current frame t Update information i t and candidate information The calculation method can be:

[0063]

[0064]

[0065]

[0066] Where σ represents the sigmoid activation function, and the value of σ is in the range (0,1), such that f t The value range of is (0,1). tanh represents the tanh activation function, and the value range of tanh is [-1,1]. W f W i W C The weight parameters are, in order: the weight parameters of the forget gate, the weight parameters of the update gate, and the weight parameters of the memory unit. f b i b C The parameters are, in order: the bias parameters of the forget gate, the bias parameters of the update gate, and the bias parameters of the memory unit.

[0067] S212, based on the forgetting information f of the current frame t Memory state C of the previous frame t-1 Obtain the retained information. Specifically, f t With C t-1 Multiplying them yields the retained information. Among them, This represents matrix multiplication.

[0068] S213, based on the retained information and the update information of the current frame i t and candidate information Get the memory state C of the current frame t .

[0069] The memory state C of the current frame t The calculation method can be:

[0070]

[0071] f t C is used to remember the previous frame's state through the forget gate. t-1 Selective filtering is performed to obtain the retained information, i.e., f t and C t-1 The product of i. t and Used to determine the value to be updated, i.e., i, through the update gate. t and The product of the two values. Adding the retained information to the updated value yields the memory state C. t .in, This represents the addition of matrices. The memory state C is used for this. t This represents the memory state of the current frame t. This allows for the filtering out of unnecessary information and the addition of new information.

[0072] Specifically, S220 may include the following steps.

[0073] S221, based on the audio features x of the current frame t And the hidden information h from the previous frame t-1 Get the initial output o of the current frame t .

[0074] The initial output of the current frame. t The calculation method can be:

[0075] o t =σ(W o ·[h t-1 ,x t ]+b o )

[0076] Where σ represents the sigmoid activation function, and the value of σ is in the range (0,1), such that o t The value range of W is (0, 1). o b represents the weight parameters of the output gate. o These are the bias parameters for the output gate. It is understandable that... Figure 4 The parameter size of the LSTM module is the size of the hyperparameters, for example, the weights are W. f W i W C W o The sum of their dimensions.

[0077] S222, based on the initial output of the current frame. t and the memory state C of the current frame t Obtain the hidden information h of the current frame t .

[0078] Hidden information h of the current frame t The calculation method can be:

[0079] h t =o t *tanh(C t )

[0080] Use the tanh function to store the memory state C of the current frame. t Scaling to the numerical range of [-1, 1], the result is then multiplied by o. t Multiplying these yields the output of the hidden nodes. This, in turn, provides the feature output of the LSTM module.

[0081] Figure 5 This is a schematic diagram illustrating the generation of a training dataset according to one embodiment of this disclosure. See also... Figure 5 In one embodiment, the training method of the above-mentioned LSTM neural network may include: obtaining a training dataset by adding reverberation and / or noise to the original dataset, and training the LSTM neural network based on the training dataset.

[0082] The results of speech recognition directly affect the performance of keyword retrieval. To improve the robustness of speech recognition and adapt to more scenarios, reverberation and / or noise can be added to the original dataset. For this purpose, three datasets can be prepared in advance: the original speech dataset for speech recognition, the noise dataset for adding noise to the audio, and the reverberation dataset for adding reverberation to the audio.

[0083] The original speech dataset DataSet0 can include spoken audio with a sampling rate of 16kHz and corresponding audio labels, such as using 26 English characters, a... <blank>The label consists of a character and three placeholders, with the three placeholders being "_", " ... <eos>"and" <sos>The audio corpus consists of 3,000 hours of speech data, which can include recordings from various age groups, genders, and devices.

[0084] The noise dataset DataSet1 can be the open-source Musan noise dataset, which can contain various noises with a sampling rate of 16 kHz, such as vehicle noise and music noise, with a total noise duration of approximately 40 hours. Noise can be added to the original dataset by randomly selecting a noise signal noise(t) from the noise dataset and a speech signal b(t) from the original dataset, and generating a noisy speech signal b'(t) according to the following formula:

[0085] b′(t)=SNR*noise(t)+b(t)

[0086] Where t is the length of the audio signal, and SNR is the signal-to-noise ratio. The value of SNR can be randomly selected from 0, 10, or 20.

[0087] The reverberation dataset DataSet2 can utilize the open-source RIR (Room Impulse Response) reverberation dataset to simulate reverberation scenarios in various rooms. The reverberation dataset can contain reverberation data from multiple different rooms with a sampling rate of 16kHz, totaling approximately 50 hours of reverberation time. Adding reverberation to the original dataset can be achieved by randomly selecting a reverberation signal r(t) from the dataset and a speech signal c(t) from the original dataset, and generating a reverberated speech signal c'(t) according to the following formula:

[0088] c ′ (t)=r(t)*c(t)

[0089] If both reverberation and noise are added to the original dataset, the reverberation can be added first followed by noise, or vice versa. If reverberation is added first, then noise is added, and the extracted b(t) is the reverberated speech signal c'(t). If noise is added first, then reverberation is added, and the extracted c(t) is the noisy speech signal b'(t). Thus, we obtain the training dataset DataSet3 after data augmentation through reverberation and / or noise addition.

[0090] For each speech data point in the training dataset DataSet3, extract features X from the speech data. X can be a matrix of n*23 frames. The batch size can be set to 1024, resulting in a training data size of 1024*n*23 for a single training iteration. Then, input the training data into the LSTM neural network for training.

[0091] During training, the CTC loss function is used, and the maximum likelihood criterion can be employed. The loss function can be specifically expressed by the following formula:

[0092]

[0093] Where S is the space of the training dataset DataSet3, seq is the input of the LSTM module, i.e., the Fbank audio features, z is the decoding path, and p(z|seq) is the posterior probability of seq corresponding to z.

[0094] The optimization mechanism used during training can be the AdaDelta optimizer. The AdaDelta optimizer is configured with rho, epsilon, and lr parameters. The rho parameter is the decay rate of the squared moving average of the gradient, which can be set to 0.95. The epsilon parameter is the fuzziness factor, which can be set to 1.0e. -8 The `lr` parameter is the learning rate, which can be set to 1.0.

[0095] S300, the dimensionality transformation of the output of the LSTM module is performed to obtain the posterior probability matrix.

[0096] See Figure 3 and Figure 4 The 64-dimensional feature vector output by the LSTM module can be transformed using the CTC (Connected Temporal Classification) module. The CTC module is configured with a linear layer Linear2 and a corresponding softmax activation function. The Linear2 layer can have 64*30 parameters (corresponding to the parameter size of the CTC layer in Table 1), and its output dimension is the number of labels (e.g., 30 labels).

[0097] It is understandable that after obtaining the feature output of the LSTM module, the feature output can be normalized using the softmax function, and the value range of the normalized feature output is (0-1).

[0098] x t The input to the LSTM neural network is, for example, the 23-dimensional Fbank feature vector of frame t. The output of the LSTM neural network is a 30-dimensional posterior probability vector of frame t, where 30 dimensions correspond to 30 target labels. Target labels can be all English letters and... <blank>Characters and placeholders, etc. This allows for the transformation from a 64-dimensional vector to a 30-dimensional vector. The posterior probability matrix Y is:

[0099]

[0100] Where D represents the number of tag categories, and T represents the maximum time value, which is the number of frames in the audio. The element y in the T-th column and D-th row of the posterior probability matrix Y... DT Let T represent the posterior probability of identifying the label as the Dth label at time T.

[0101] Assuming that 500 frames of audio features are obtained through step S100, the LSTM neural network will process them frame by frame. After processing the feature vector of the (t-1)th frame, it will output the result of the (t-1)th frame, and then input the feature vector of the tth frame and start processing the tth frame, until all frames are processed, resulting in a posterior probability matrix of size 500*30.

[0102] S400 generates a keyword decoding graph, which is a directed cyclic graph. The directed cyclic graph contains all possible paths for each keyword.

[0103] A directed graph with cycles is a directed graph that contains cycles. Figure 6 This is a schematic diagram of a directed cyclic graph based on keywords of one embodiment of this disclosure. See also... Figure 6 Based on the decoding characteristics of CTC (Connection-Temporal Classification), a keyword decoding graph G is constructed using states as nodes for the keyword set. The initial state is 0, and the final state of each keyword is its position in the keyword set. ε represents whitespace characters in the labels. <blank>.

[0104] Each node in the diagram represents a state number. The transition from the current state to the next state involves passing a keyword label. For example, moving from state 0 to state 1 passes a label, while moving from the start state to the end state corresponds to a keyword path. Assume there is a keyword set K, containing q predefined keywords, k... 1 k 2 , ..., k q Each keyword is composed of characters from the modeling unit; for example, the first keyword can be composed of j1 characters. The second keyword can consist of j2 characters. The character composition of other keywords follows the same pattern.

[0105] For any node, the available paths can include loop paths, which are paths from the current node directly to itself without passing through any other nodes. For the keyword k... 1 The keyword is k, which represents the transition from the initial state 0 to the final state 1. 1 All possible paths. For example, if the current state is 0, the character to proceed to state 11 is... The corresponding optional paths include 0-11, 0-10-11, 0-11-11, 0-10-10-11, 0-11-11-11, and so on. The optional paths for other keywords follow the same pattern.

[0106] It is understandable that the LSTM neural network outputs the classification result for each frame, while speech recognition may identify it as a whitespace character. For example, the label sequence corresponding to a 10-frame audio is wo, and the corresponding output of the 10 frames of audio (the maximum result of the posterior vector of the LSTM for that frame) may be εwwεεεooεε. These 10 results can be decoded as wo, and the transition of such whitespace characters is allowed through states 10, 12, 14, etc.

[0107] If the first few characters of two keywords are the same (i.e., there is a repetition), then the first few characters of the two keywords can be merged. Specifically, starting from the first character of each keyword, determine the longest common string between the first and second keywords. Then, take the state number of the first character after the longest common string in the second keyword as the next state of the last character in the longest common string of the first keyword. For example, keyword 'k'. 1 and k 2 If the first three characters of a string are the same, then the longest string with the same first three characters is the string with the same first three characters. (This is related to the keyword 'k'.) 2 The state number 27 of the 4th character is used as k 1 The state of the next moment, which is the state number 15 of the third character.

[0108] S500 uses the posterior probability matrix to weight the optional paths in a directed cyclic graph to obtain the path score of the keyword, thereby identifying the keyword.

[0109] After constructing the keyword search graph, the search graph is weighted according to the posterior probability matrix Y. After calculating the posterior vector of each frame, a score is calculated for the state in the graph, and the score is passed to the state in the graph. The score is used as the weight of the edge in the graph, thereby realizing the weighting of the optional paths and obtaining the optimal path (i.e., path score) in the search graph.

[0110] S500 may specifically include the following steps.

[0111] S510, for each audio frame's posterior probability vector, determine the sub-state corresponding to each current state of the bundle from the directed cyclic graph, where the sub-state is a child node of the state in the directed cyclic graph.

[0112] For each frame of audio obtained using the LSTM model, the posterior vector y t Let t∈[1,T], where T is the number of frames in the audio, and also T in the 30*T dimension of the posterior matrix. In the 30-dimensional posterior vector, each dimension represents the probability of a label. Traverse all states in the beam, denoted as beam_state for each state, and find its substate in the constructed keyword graph G, denoted as next_state. Here, a substate refers to a child node of a state in the keyword decoding graph G.

[0113] S520 calculates the path score to the current substate based on the posterior probability vector.

[0114] During the weighting process, the keyword graph can be pruned simultaneously using the Beam Search algorithm, thereby controlling the computational power and memory required for deployment. Beam Search is a heuristic graph search algorithm typically used when the solution space of the graph is large. To reduce the space and time consumed by the search, some low-quality nodes are pruned at each depth expansion step, retaining some high-quality nodes. This reduces space consumption and improves time efficiency.

[0115] Assume that the label passed from the beam state beam_state to the child state next_state is token, where token is the index of the label, and token∈[1,30], y token Let represent the posterior vector of the th dimension of the token, and num_frames represent the number of frames from the initial state to the current state. The path score to the current state (current frame) can be calculated using the following formula based on the posterior probability vector:

[0116]

[0117] Where, cur_frame is the time frame at which the current state record is reached, start_frame is the time frame at which the keyword begins, and y t The posterior probability of the label corresponding to the state at time t is obtained from the posterior vector matrix.

[0118] For each time step, the state beam_state in the bundle is searched for its substate next_state in the keyword graph, and the path score of each substate next_state is calculated and updated in the new bundle.

[0119] S530: Select at least some sub-states based on the path score and update them in the bundle.

[0120] If multiple substates have the same tag token, select the substate with the highest path score to update the beam. The largest beam width (beam_size) states are selected from the substate sequence.

[0121] S540 prunes the sub-states in the bundle based on the pruning threshold and path score.

[0122] Specifically, states in the current beam with a score less than the pruning threshold prune_prob are deleted. The pruning threshold prune_prob can be determined as follows: determine the path score of each current state in the beam recorded at each time step, determine the highest score from all the path scores of the current states, and use the difference between the highest score and the preset threshold prune_threshold as the pruning threshold.

[0123] S550: The initial state is added to the beam, and the sub-states in the beam are used as the new current states. Weighting is then applied again until all keywords are identified. If the states in the beam meet a preset termination condition, the exp value of the score for each state in the beam is determined. If the exp value is greater than a preset confidence threshold, the keyword is identified.

[0124] Since the start time of the keyword is unknown, state 0 is added to the beam, and the sub-state in the beam at this time is used as the current state in the next weighted loop. Then, the process jumps to step S510 and re-executes step S500. For a continuous audio segment, the initial state is added to the beam at each moment to ensure that each moment is the start state of the keyword. During the loop execution, the number of loops is the number of audio frames.

[0125] It is understandable that since the current substate next_state will be updated into the beam in the next time step and become the beam_state in the next time step, the path score of the beam_state in the next time step is the path score of the current substate next_state.

[0126] During the entire pruning process, if the state in the beam corresponds to a pre-set termination state, the score of that state is calculated by exponentiation and compared with the pre-set confidence threshold. If exponentiation(beam_state.score) > confidence_threshold, it means that the keyword has been identified, i.e., detected.

[0127] Specifically, for example, an empty beam is initialized at time 0, and state 0 is added to it. Then, at time 1, the sub-states of state 0 include 21 and 22, so sub-states 21 and 22 are added to the beam at time 1, resulting in the current beam containing sub-states 21 and 22. The token corresponding to 0-21 is blank, and the token corresponding to 0-22 is j. The scores of these two tokens in y1 are 0.9 and 0.0001, respectively. From this, the path scores of sub-states 21 and 22 can be calculated (i.e., the process of weighting the graph). Then, during the pruning process, it is determined that sub-state 22 has a low score, so it is pruned. The beam at time 2 will not contain sub-state 22 but will still contain sub-state 21. Then, sub-state 21 is used as the current state, and the sub-states of state 21 are determined until the termination condition is met. During the process, the beam is updated at each time step.

[0128] The speech keyword recognition method of this embodiment can be deployed on a small embedded processor, which typically has small RAM and low clock speed. The number of parameters in the LSTM network can be configured to be around 95K. If floating-point data is used to store these parameters, a total of (95000*4 / 1024) = 345KB of storage space is required.

[0129] In order to be able to deploy on these low-resource devices, the parameters can be quantized into integer (int) data, and the LSTM neural network can be quantized layer by layer. The weight matrix of each layer is asymmetrically quantized using the following formula (1), and the bias parameter of each layer in the network is symmetrically quantized using the following formula (2).

[0130] x int_1 =round(x / s)+z (1)

[0131] x int_2 =round(x / s) (2)

[0132] Where, x int_1 The value is the asymmetric quantized value, x int_2 This is the value after symmetric quantization. The round function is used to calculate the rounded value, where x is the value before quantization, s is the scaling factor, and z is the bias in asymmetric quantization.

[0133] Symmetric quantization primarily involves using a preset first scaling factor to map the maximum absolute value of a tensor of a first data type to the maximum value of a second data type, and mapping the negative value of the maximum absolute value to the minimum value of the second data type. The precision of the first data type is higher than that of the second data type. For example, setting a scaling factor of scale1 maps the maximum absolute value of a float32 tensor to the maximum value of an int8 tensor, and the negative value of the maximum absolute value to the minimum value of an int8 tensor.

[0134] Asymmetric quantization primarily involves using a preset second scaling factor and zero points to map the maximum absolute value of a tensor of a first data type to the maximum value of a tensor of a second data type, and vice versa. The precision of the first data type is higher than that of the second data type. For example, setting the scaling factor `scale` maps the maximum absolute value of a float32 tensor to the maximum value of an int8 tensor, and the minimum value to the minimum value of an int8 tensor.

[0135] As a result, large-scale networks suitable for desktops can be streamlined and optimized, reducing computational and storage consumption to a level acceptable to embedded processors, enabling them to be deployed on small embedded devices and achieve good retrieval performance.

[0136] The keyword recognition method of this embodiment was tested, and the recognition rate reached 90%, with a false alarm rate of 1 per 5 hours, indicating a high recognition accuracy.

[0137] The speech keyword recognition method proposed according to the embodiments of this disclosure can recognize new keywords even if the keywords are changed, by constructing a corresponding keyword map and combining it with the original acoustic model. This achieves model reuse, reduces the cost of model construction, improves the efficiency of keyword recognition, and has a high recognition accuracy.

[0138] Figure 7 This is a schematic diagram of a speech keyword recognition device employing a hardware implementation of a processing system according to one embodiment of the present disclosure. (See also...) Figure 7 The voice keyword recognition device 1000 of this embodiment may include a memory 1300 and a processor 1200. The memory 1300 stores execution instructions, and the processor 1200 executes the execution instructions stored in the memory 1300, causing the processor 1200 to execute the voice keyword recognition method of any of the above embodiments.

[0139] The device 1000 may include corresponding modules that perform one or more steps in the flowchart described above. Therefore, each or more steps in the flowchart can be performed by a corresponding module, and the device may include one or more of these modules. A module may be one or more hardware modules specifically configured to perform a corresponding step, or implemented by a processor configured to perform a corresponding step, or stored in a computer-readable medium for implementation by a processor, or implemented through some combination thereof.

[0140] For example, the speech keyword recognition device 1000 of this embodiment may include an audio feature extraction module 1002, a feature output acquisition module 1004, a posterior probability acquisition module 1006, a directed graph generation module 1008, and a keyword determination module 1010.

[0141] The audio feature extraction module 1002 extracts audio features from the audio signal to be recognized. The feature output acquisition module 1004 maps the audio features into the LSTM module to a high-dimensional feature space to obtain the output. The posterior probability acquisition module 1006 performs a dimensionality transformation on the output of the LSTM module to obtain the posterior probability matrix. The directed graph generation module 1008 generates a keyword decoding graph, which is a directed cyclic graph containing all possible paths for the keywords. The keyword determination module 1010 weights the possible paths in the directed cyclic graph based on the posterior probability matrix to obtain the path score of the keyword, thereby identifying the keyword.

[0142] It should be noted that details not disclosed in the speech keyword recognition device 1000 of this embodiment can be found in the details disclosed in the speech keyword recognition method M10 of the above-described embodiment of this disclosure, and will not be repeated here.

[0143] The device 1000 may include corresponding modules that perform one or more steps in the flowchart described above. Therefore, each or more steps in the flowchart can be performed by a corresponding module, and the device may include one or more of these modules. A module may be one or more hardware modules specifically configured to perform a corresponding step, or implemented by a processor configured to perform a corresponding step, or stored in a computer-readable medium for implementation by a processor, or implemented through some combination thereof.

[0144] This hardware architecture can be implemented using a bus architecture. The bus architecture can include any number of interconnect buses and bridges, depending on the specific application and overall design constraints of the hardware. Bus 1100 connects various circuits, including one or more processors 1200, memory 1300, and / or hardware modules. Bus 1100 can also connect various other circuits 1400, such as peripherals, voltage regulators, power management circuits, external antennas, etc.

[0145] Bus 1100 can be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, or an Extended Industry Standard Component (EISA) bus, etc. Buses can be categorized as address buses, data buses, control buses, etc. For ease of representation, only one connection line is used in this diagram, but this does not imply that there is only one bus or only one type of bus.

[0146] The speech keyword recognition device proposed according to the embodiments of this disclosure can recognize new keywords even if the keywords are changed, by constructing a corresponding keyword map and combining it with the original acoustic model. This achieves model reuse, reduces the cost of model construction, improves the efficiency of keyword recognition, and has a high recognition accuracy.

[0147] Any process or method description in the flowcharts or otherwise herein can be understood as representing a module, segment, or portion of code comprising one or more executable instructions for implementing a particular logical function or process, and the scope of the preferred embodiments of this disclosure includes additional implementations in which functions may be performed not in the order shown or discussed, including substantially simultaneously or in reverse order depending on the functions involved, as will be understood by those skilled in the art to which embodiments of this disclosure pertain. The processor performs the various methods and processes described above. For example, the method embodiments of this disclosure may be implemented as software programs tangibly contained in a machine-readable medium, such as memory. In some embodiments, part or all of the software program may be loaded and / or installed via memory and / or a communication interface. When the software program is loaded into memory and executed by the processor, one or more steps of the methods described above may be performed. Alternatively, in other embodiments, the processor may be configured to perform one of the methods described above by any other suitable means (e.g., by means of firmware).

[0148] The logic and / or steps represented in the flowchart or otherwise described herein may be specifically implemented in any readable storage medium for use by, or in conjunction with, an instruction execution system, apparatus or device (such as a computer-based system, a processor-included system or other system that can fetch and execute instructions from, an instruction execution system, apparatus or device).

[0149] It should be understood that various parts of this disclosure can be implemented in hardware, software, or a combination thereof. In the above embodiments, multiple steps or methods can be implemented in software stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, it can be implemented using any one or a combination of the following techniques known in the art: discrete logic circuits having logic gates for implementing logical functions on data signals, application-specific integrated circuits (ASICs) having suitable combinational logic gates, programmable gate arrays (PGAs), field-programmable gate arrays (FPGAs), etc.

[0150] Those skilled in the art will understand that all or part of the steps of the methods described above can be implemented by a program instructing related hardware, and the program can be stored in a readable storage medium. When executed, the program includes one or a combination of the steps of the method implementation.

[0151] Furthermore, the functional units in the various embodiments of this disclosure can be integrated into a single processing module, or each unit can exist physically separately, or two or more units can be integrated into a single module. The integrated module can be implemented in hardware or as a software functional module. If the integrated module is implemented as a software functional module and sold or used as an independent product, it can also be stored in a readable storage medium. The storage medium can be a read-only memory, a disk, or an optical disk, etc.

[0152] In the description of this specification, the references to "one embodiment / mode," "some embodiments / modes," "specific example," or "some examples," etc., indicate that a specific feature, structure, material, or characteristic described in connection with that embodiment / mode or example is included in at least one embodiment / mode or example of this disclosure. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment / mode or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments / modes or examples. Moreover, without contradiction, those skilled in the art can combine and integrate the different embodiments / modes or examples described in this specification, as well as the features of different embodiments / modes or examples.

[0153] Furthermore, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one of that feature. In the description of this disclosure, "a plurality of" means at least two, such as two, three, etc., unless otherwise explicitly specified.

[0154] Those skilled in the art should understand that the above embodiments are merely for illustrating the present disclosure and are not intended to limit the scope of the disclosure. Those skilled in the art can make other changes or modifications based on the above disclosure, and these changes or modifications still fall within the scope of the present disclosure.< / blank> < / blank> < / sos> < / eos> < / blank>

Claims

1. A method for recognizing speech keywords, characterized in that, include: Audio features are obtained by extracting features from the audio signal to be identified; The audio features are input into the LSTM module and mapped to a high-dimensional feature space to obtain the output. The posterior probability matrix is ​​obtained by performing a dimensionality transformation on the output of the LSTM module. Generate a keyword decoding graph for the keywords, wherein the keyword decoding graph is a directed cyclic graph and the directed cyclic graph contains all possible paths for the keywords; The path scores of the keywords are obtained by weighting the optional paths in the directed cyclic graph based on the posterior probability matrix, thereby identifying the keywords. The weighting of the selectable paths in the directed cyclic graph based on the posterior probability matrix includes: For each audio frame's posterior probability vector, a sub-state corresponding to each current state of the bundle is determined from the directed cyclic graph, wherein the sub-state is a child node of the state in the directed cyclic graph; Calculate the path score to the current substate based on the posterior probability vector; Based on the path score, at least some of the sub-states are selected and updated in the bundle; Prune the sub-states in the bundle based on the pruning threshold and the path score; The initial state is added to the bundle, and the sub-states in the bundle are used as the new current state. The weighting is repeated until the audio signal to be identified is completed. If the current state in the cluster meets the preset termination condition, the path score exp value of the current state is determined, and the keyword is identified when the exp value is greater than the preset confidence threshold.

2. The method according to claim 1, characterized in that, The audio features are input into the LSTM module to obtain feature output, including: For each hidden node, the memory state of the current frame is obtained based on the audio features of the current frame; The hidden information of the current frame is obtained based on the audio features of the current frame and the memory state of the current frame.

3. The method according to claim 2, characterized in that, The memory state of the current frame is obtained based on its audio features, including: Based on the audio features of the current frame and the hidden information of the previous frame, the forgotten information, updated information and candidate information of the current frame are obtained; The retained information is obtained based on the forgotten information of the current frame and the memory state of the previous frame; The memory state of the current frame is obtained based on the retained information, the updated information of the current frame, and the candidate information.

4. The method according to claim 2, characterized in that, The hidden information of the current frame is obtained based on the audio features of the current frame and the memory state of the current frame, including: The initial output of the current frame is obtained based on the audio features of the current frame and the hidden information of the previous frame; The hidden information of the current frame is obtained based on the initial output of the current frame and the memory state of the current frame.

5. The method according to claim 1, characterized in that, The path score for the sub-state is calculated using the following formula: Where cur_frame is the time frame at which the current state record is reached, and start_frame is the time frame at which the keyword begins. Let be the posterior probability of the label corresponding to the state at time t.

6. The method according to claim 1, characterized in that, Selecting at least some of the sub-states based on the path score and updating them in the bundle includes: If multiple substates have the same label, the substate with the highest path score is selected to update the bundle.

7. The method according to claim 1, characterized in that, The pruning threshold is determined by the following methods: Determine the path score for each current state in the bundle at each time step, identify the highest score from all the path scores for the current states, and use the difference between the highest score and a preset threshold as the pruning threshold.

8. The method according to claim 1, characterized in that, The path score of the next state is the path score of the current substate.

9. A voice keyword recognition device, characterized in that, include: The memory stores execution instructions; as well as A processor that executes execution instructions stored in the memory, causing the processor to perform the method as described in any one of claims 1 to 8.