Speech recognition method and device, computer equipment and storage medium
A speech recognition and phoneme recognition technology, which is applied in speech recognition, speech analysis, instruments, etc., can solve the problems of affecting speech recognition accuracy, increasing deletion errors, and increasing error rate in decoding process, so as to improve recognition accuracy and reduce deletion Error, likelihood-reducing effect
Pending Publication Date: 2021-10-22
TENCENT TECH (SHENZHEN) CO LTD
0 Cites 5 Cited by
AI-Extracted Technical Summary
Problems solved by technology
[0004] However, the RNN-T model introduces the concept of empty output in the phoneme recognition process, that is, it predicts that a speech frame does not contain a valid phoneme. The introduction of empty outp...
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View moreMethod used
Through the scheme shown in the application, on the one hand, the end-to-end model based on Transducer does not need frame-level alignment information during the training process, which greatly simplifies the modeling process; secondly, the decoding map is simplified, reducing search space. However, in the method proposed in this scheme, because phoneme modeling is used, only L and G are needed to decode the decoded image, and the search space is greatly reduced. Finally, using phoneme modeling, combined with custom decoding maps, can achieve flexible customization requirements. According to different business scenarios, without changing the acoustic model, only need to customize the language model to adapt to their respective business Scenes.
[0126] The above-mentioned Transducer model, in order to describe the historical information of the model, the Encoder and Predictor networks generally adopt a recurrent neural network (Recurrent Neural Network, RNN) structure, such as LSTM or gated recurrent unit (Gated Recurrent Unit, GRU). However, on embedded devices with limited computing resources, the recurrent neural network will bring a large amount of calculation and will cause a large amount of CPU resource occupation. On the other hand, the content of vehicle-mounted offline speech recognition is mainly query and control instructions, and the sentences are relatively short, without too long historical information. In this regard, this program uses the Encoder based on FSMN and the Predictor network based on one-dimensional convolution. On the one hand, model parameters can be compressed; on the other hand, computing resources can be greatly saved, computing speed can be improved, and real-time performance of speech recognition can be...
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View moreAbstract
The invention relates to a speech recognition method and device, computer equipment and a storage medium, and relates to the technical field of speech recognition. The method comprises the following steps: processing a speech signal through an acoustic model to obtain a phoneme recognition result corresponding to each speech frame in the speech signal; suppressing and adjusting the probability of null output in the phoneme recognition result corresponding to each speech frame so as to reduce the ratio of the probability of null output in the phoneme recognition result to the probability of each phoneme; and inputting the adjusted phoneme recognition result corresponding to each speech frame into a decoded image to obtain a recognition text sequence corresponding to the speech signal. According to the scheme, the recognition accuracy of the model can be improved in a speech recognition scene in the field of artificial intelligence.
Application Domain
Technology Topic
Image
Examples
- Experimental program(1)
Example Embodiment
[0056] Exemplary embodiments will be described in detail herein, and examples are illustrated in the drawings. The following description is related to the drawings, unless otherwise indicated, the same figures in the different drawings represent the same or similar elements. The embodiment described in the exemplary embodiments is not meant to all embodiments consistent with the present application. Instead, they are only examples of apparatus and methods consistent with some aspects of the present application as detailed in the appended claims.
[0057] Before the various embodiments shown in the present application, the few concepts involved in the present application are introduced:
[0058] 1) Artificial Intelligence (AI)
[0059] AI is the use of digital computer or digital computer controlled machine simulation, extension, and expanding people's intelligence, perceived environments, access to knowledge and use knowledge to achieve best results. In other words, artificial intelligence is a comprehensive technology of computer science, which attempts to understand the essence of intelligence and produce a new intelligent machine that reacts in a manner's intelligence. Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, making the machine with perception, reasoning and decision-making.
[0060] Artificial intelligence technology is a comprehensive discipline, involving a wide range of fields, and there are both hardware-level technologies. Artificial intelligence basic techniques generally include technologies such as sensors, special artificial intelligence chips, cloud computing, distributed storage, large data processing technology, operation / interactive system, and electromechanical integration. Artificial intelligence software technology mainly includes computer visual technology, speech processing technology, natural language processing technology, and machine learning / deep learning.
[0061] 2) Speech Technology (SPECH Technology, ST)
[0062] The key technologies of voice technology include AutomaticSpeechRecognition, ASR and Voice Synthesis Technology (TTS), and voiced identification technology. Let the computer can listen, can see, can feel, it is the development direction of future human-machine interaction, where voice becomes one of the most optimistic ways to interact with the future.
[0063] 3) Machine Learning, ML)
[0064] Machine learning is a multi-mate subject such as cross-disciplines, probability, statistical, approximation, convex analysis, algorithm complexity theory, etc. Specially study how computer simulates or achieve human learning behavior, to obtain new knowledge or skills, reorganize existing knowledge structure constantly improving their performance. Machine learning is the core of artificial intelligence, which has the fundamental way to make computer intelligence, which is applied throughout the field of artificial intelligence. Machine learning and depth learning usually include artificial neural networks, confidence networks, strengthen learning, migration learning, induction, teaching learning, etc.
[0065] The scheme provided by the embodiments of the present application applies to scenes such as voice technology and machine learning technologies involving artificial intelligence, to implement the identification of user speech as a corresponding text.
[0066] Please refer to figure 1 It shows a system configuration diagram of a speech recognition system according to various embodiments of the present application. like figure 1 As shown, the system includes sound acquisition assembly 120, and a speech recognition device 140.
[0067] The sound acquisition assembly 120 and the voice recognition device 140 are connected by a wired or wireless mode.
[0068] The sound acquisition assembly 120 can be implemented as a microphone, a microphone array or a pickup. The sound acquisition component 120 is used to collect voice data when the user speaks.
[0069] The speech recognition device 140 is configured to identify the voice data acquired by the sound acquisition assembly 120 to obtain a text sequence identified.
[0070] Optionally, the speech recognition device 140 can also perform natural semantic processing for identified text sequences to respond to user speech.
[0071] The sound acquisition assembly 120 and the speech recognition device 140 can be implemented as two hardware devices independently. For example, the sound acquisition assembly 120 is a microphone provided on the vehicle steering wheel, and the speech recognition device 140 can be a car intelligent device; or the sound acquisition assembly 120 is a microphone disposed on the remote control, the speech recognition device 140 can be a remote control control. Smart home equipment (such as smart TV, set top box, air conditioning, etc.).
[0072] Alternatively, the sound acquisition assembly 120 and the speech recognition device 140 can be implemented as the same hardware device. For example, the speech recognition device 140 can be a smartphone such as a smartphone, a tablet, a smart watch, a smart glasses, and the sound acquisition assembly 120 can be a microphone built into the speech recognition device 140.
[0073] In a possible implementation, the speech recognition system may further include a server 160.
[0074] The server 160 can be used to deploy and update the speech recognition model in the speech recognition device 140. Alternatively, the server 160 can also provide the voice recognition device 140 to provide a service transmitted by the voice recognition device 140, i.e., the voice data transmitted by the speech recognition device 140, and the identification result is returned to the speech recognition apparatus 140 after speech data. Alternatively, the server 160 can also cooperate with the speech recognition device 140 to complete the identification of voice data and the response to voice data.
[0075] Server 160 is a server, or by several servers, or a virtualization platform, or a cloud computing service center.
[0076] The server can be a stand-alone physical server, or a server cluster or distributed system composed of multiple physical servers, but also providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, intermediates. Computer, Domain Name Service, Security Services, Content Distribution Network, and cloud servers for infrastructure cloud computing services such as big data and artificial intelligence platforms.
[0077] The server 160 is connected to the voice recognition device 140 through a communication network. Optionally, the communication network is a wired network or a wireless network.
[0078] Optionally, the system can also include a management device ( figure 1 Not shown), the management device is connected between the server 160. Optionally, the communication network is a wired network or wireless network.
[0079] Optionally, the above wireless network or wired network uses standard communication technologies and / or protocols. The network is usually the Internet, but it can also be any network, including but not limited to local area network (Local Area Network, Man), WAN, WAN, Mobile, Wired or Wireless Any combination of networks, dedicated networks, or virtual private networks. In some embodiments, the data exchanged over the network is represented using techniques and / or formats including hyper text mark-up language, html, extensible MarkUplanguage, XML. In addition, it is also possible to use a secure socket layer, SSL, Transport Layer Security, TLS, Virtual Private Network, VPN, Internet protocolsecurity, ipsec, etc. Conventional encryption techniques to encrypt all or some links. In other embodiments, custom and / or dedicated data communication techniques can also be used to replace or supplement the above data communication techniques.
[0080] Please refer to figure 2 It is a flowchart of a speech recognition method according to an exemplary embodiment, which can be used for computer devices. For example, the computer device can be the above figure 1 The speech recognition apparatus 140 or server 160 in the system, or the computer device may include the above figure 1 The speech recognition device 140 and the server 160 in the system are shown. like figure 2 As shown, the speech recognition method can include the following steps:
[0081] Step 21, the voice signal is processed by the acoustic model, obtaining a soundme recognition result corresponding to each speech frame in the speech signal; the phoneme recognition result is used to indicate the probability distribution of the corresponding speech frame in the phoneme space; the phoneme space is included Each phoneme and an empty output; the acoustic model is obtained by speech signal samples, and actual phonology training of each voice frame in the speech signal sample.
[0082] The phoneme is the minimum voice unit divided according to the natural attribute of the voice. It is analyzed according to the pronunciation operation in the syllable, and an action constitutes a phoneme. The phoneme is divided into two categories of vowels and consonants. For example, Chinese syllables (ā) have only one phoneme, love (ài) has two phonemes, generation (dài) has three phonemes, etc.
[0083] The phoneme is the minimum unit or the smallest voice segment constituting the syllable, which is the smallest linear voice unit that is divided from the angle of the sound quality. The phoneme is a specific physical phenomenon. International phonetic symbols (developed by international voice learning, used to unify the letters of the vowels. It is also known as the phonetic symbol of "International Speech Letters", "Volume Sports Letters".
[0084] In the present application embodiment, for each voice frame in the speech signal, the acoustic model can identify the voice frame corresponding to the voice frame, to obtain the probability of the voice frame belonging to the respective pre-set speakers and the empty output.
[0085] For example, in a possible implementation, the above-described phoneme space contains 212 kinds of phonemes and an empty output (indicating that the corresponding speech frame is not a user pronunciation), that is, for an input speech frame, the present application embodiment The acoustic model can output the probability of corresponding 212 phones and empty outputs, respectively.
[0086] Step 22: Suppressing the probability of an empty output in the sound output of the respective speech frames to reduce the ratio of the probability of the empty output in the phoneme identification result and the probability of the probability of each phoneme.
[0087] Step 23, input a decoding map corresponding to the respective speech frames corresponding to the adjusted, obtain the identification text sequence corresponding to the speech signal.
[0088]In the present application embodiment, after inputting the decoding map, the decoding map determines that the phoneme recognition result corresponds to a certain phoneme or correspondence based on the probability of each of the phoneme spaces in the phoneme identification result. The output is output, and the corresponding text is determined according to the determination of the phoneme, if the sound output is an empty output, it is determined that the voice frame corresponding to the phoneme recognition does not include the user pronunciation, that is, the corresponding text.
[0089] Since the empty output is included in the above-described phoneme recognition, it may result in a case where the recognition error rate rises, for example, there may be a case where a pronunciation speech frame is misleared (this is also referred to as a delete error) Therefore, the accuracy of the speech recognition is thus, according to the present application, the scheme shown in the present application, after the acoustic model outputs the phoneme identification result, suppresses the probability of the empty output in the sound output of the sounding identification result, and with the resolution of the phone The probability of an empty output is suppressed, and the scope recognition result is identified as a certain number of phonemes, which can be effectively reduced, and the pronunciation of speech frames are erroneously identified as an empty output.
[0090] In summary, the scheme shown in the present application, for the speech chamber recognition result for the probability distribution of the probability distribution of the speech frame, in each phoneme and the empty output, before inputting the phoneme recognition result, first on the phoneme recognition result The probability of empty output is suppressed, and the probability of reducing the speech frame is identified as an empty output, thereby reducing the likelihood that the speech frame is identified as an empty output, that is, reducing the deletion of the model, thereby increasing the identification accuracy of the model.
[0091] Please refer to image 3 It is a flowchart of a speech recognition method according to an exemplary embodiment, which can be used for computer devices. For example, the computer device can be the above figure 1 The speech recognition apparatus 140 or server 160 in the system, or the computer device may include the above figure 1 The speech recognition device 140 and the server 160 in the system are shown. like image 3 As shown, the speech recognition method can include the following steps:
[0092] Step 301 acquire a speech signal, the speech signal comprising each speech frame obtained by dividing the original voice.
[0093] In the present application embodiment, the original voice is sent to the computer device after the user's speech acquired by the sound acquisition component, for example, sends a voice recognition device, and the speech recognition device seizes the original voice. , Obtain a number of voice frames.
[0094] In a possible implementation, the speech recognition device can divide the original voice into a short-time short-time voice clip, for example, a voice length of the sampling rate of 16K, a single frame voice length is 25ms, inter-frame The overlap is 15ms, which is also known as "framing".
[0095] Step 302, the voice signal is processed by the acoustic model to obtain a soundme recognition result corresponding to each speech frame in the speech signal.
[0096] Wherein, the phoneme recognition result is used to indicate the probability distribution of the corresponding speech frame in the phoneme space; the phoneme space contains each phoneme and an empty output; the acoustic model is viable via speech signal sample, and the voice signal samples The actual phoneme training of the frame is obtained.
[0097] In the present application embodiment, the acoustic model is an end-to-end machine learning model, and its input data includes a speech frame in a speech signal (e.g., input a feature vector including speech frame), and outputs data as a voice frame predicted. The distribution probability in the phoneme space, that is, the result of the vagin recognition.
[0098] For example, the above-described phoneme recognition can be represented as a probability vector as follows:
[0099] (p 0 , P 1 , P 2 , ... p 212 )
[0100] The above probability vector, P 0 Indicates the probability of the voice frame is empty, P 1 Indicates that the voice frame corresponds to the probability of the first phony, and the entire phoneme space contains 212 kinds of phonemes, plus an empty output.
[0101] In a possible implementation, the speech signal is processed by the acoustic model to obtain a soundme recognition result corresponding to each speech frame in the speech signal, including:
[0102] The target speech frame is extracted, and the feature vector of the target speech frame is obtained; the target speech frame is any of the respective speech frames;
[0103] Enter the encoder in the acoustic model to obtain an acoustic hidden layer representation of the target speech frame;
[0104] Enter a predictor in the acoustic model in the acoustic model of the history speech frame to obtain a text hidden layer representation vector of the target speech frame; the history identification text of the target speech frame is the decoding map to the target voice. The first N non-empty output speech frame of the frame is recognized; N is an integer greater than or equal to 1;
[0105] The acoustic hidden layer of the target speech frame indicates the vector, and the text hidden layer of the target speech frame indicates the vector input integration network to obtain the phoneme recognition result of the target speech frame.
[0106] In the present application embodiment, the above acoustic model can be realized by a transducer model. The Transducer model is described below:
[0107] Given input sequence:
[0108]
[0109] And output sequences:
[0110]
[0111] in, Represents a collection of all input sequences, Represents a collection of all output sequences, Both are real numbers, with Indicates the input and output space, respectively. For example, in this scenario, the Transducer model is used to perform phoneme recognition, and the input sequence X is a feature vector sequence, such as a filter group (FBANK) feature, or Mel Frequency Cepstrum Coefficient, MFCC. , X t Indicates the feature vector of the T time; the output sequence y is a phoneme sequence, Y u Represents the phoneme of the term U.
[0112] Define an extension output space Indicates an empty output symbol, representing the model does not output. After introducing an empty output symbol, the sequence Equal price In this scheme, because the introduction of air outlets, the output sequence and the input sequence will have the same length, and therefore, the collection will also be set Element "Align". Given the arbitrary input sequence, the Transducer model defines a conditional distribution This condition distribution will be used to calculate the probability of outputting the sequence y after the given input sequence X:
[0113]
[0114] in Indicates the empty output in the alignment sequence, Indicates that an alignment sequence is generated to add an empty output to the output sequence. As can be seen from the formula (1), in order to calculate the probability of the output sequence y, there is a need to sumify the conditional probability of all possible align a corresponding to the sequence y, please refer to Figure 4 It shows a schematic diagram of an alignment process according to an embodiment of the present application. Should Figure 4 An example is given to illustrate the formula (1).
[0115] exist Figure 4 In the middle, u = 3, t = 5, from the lower left corner to all possible paths in the upper right corner, are aligned. The bold arrow is marked for one of the possible paths. When the model is further further, a non-empty symbol (phoneme) will output; when the model is further further, the output empty symbol (ie the empty output) is output when the model is further advanced. Indicates that there is no output generation. At the same time, the model allows multiple outputs.
[0116] For modeling Generally, three sub-networks are used for joint modeling, please refer to Figure 5 It shows a schematic structural diagram of an acoustic model according to the embodiment of the present application. like Figure 5 Indicated. The acoustic model includes an encoder 51, a predictor 52, and a combination network 53.
[0117] Among them, encoder 51 (Encoder) can be recursive neural networks, such as long short-term memory (LSTM) networks, audio feature inputs at T time, output acoustic hidden layer
[0118] Predictor 52 (Predictor), can be recursive neural network, such as LSTM, non-empty output label for receiving model history Output is a text hidden layer
[0119] Joint Network 53 (Joint Network), can be fully connected neural network, such as linear layer to activate unit, for with After the linear transformation, the output hidden unit represents Z i Finally, after a SoftMax function, convert it into a probability distribution.
[0120] Above Figure 5 middle, Allow Finally, the calculation of the formula (1) is:
[0121]
[0122] The calculation of the formula (2) requires traversal calculation, and directly uses the algorithm to cause a large amount of calculation. During the model training process, the probability calculation of the equation (2) can be performed in the model training process.
[0123] In a possible implementation, the encoder is a forward sequence memory network (FeEDforwardSequential Memory Networks, FSMN).
[0124] In a possible implementation, the predictor is a one-dimensional distribution network.
[0125] As shown in the embodiment of the present application, a scenario that can be applied in a car offline speech recognition system or the like. The vehicle equipment is high for model parameters and calculation amount, and the central processor is limited, so the amount of model parameters and model structures are high. In order to reduce the amount of computational amount, adapt to the limited application scenarius of such computing power, the scheme shown in this application uses full forward neural network FSMN as an Encoder (encoder), and uses one-dimensional volume network to replace common long-term memory Network LSTM as a Predictor.
[0126] The above TRANSDUCER model, in order to portray the history information of the model, the Encoder and Predictor networks generally use the recruit neural network (RNN) structure, such as LSTM or Gate Circuit Unit, GRU. However, on embedded devices with limited resources, recursive neural networks bring a lot of computational quantities, which will bring a lot of CPU resources. On the other hand, the content of the car offline speech recognition is mainly instructions for query and control, and sentences are relatively short, no need for too long history. In this regard, this scheme uses the FSMN-based Encoder and a Predictor network based on one-dimensional volume. On the one hand, the model parameters can be compressed. On the other hand, it can greatly save computational resources, improve the calculation speed, and ensure the real-time performance of speech recognition.
[0127] In this scenario, an FSMN-based ENCODER structure is adopted. The FSMN network is applied to the big word speech recognition task. The FSMN structure employed in this scenario can be a structure with a projection layer and a residual connection.
[0128] For the Predictor network, a one-dimensional volume network is used in this scenario, and the current output is generated based on a limited historical prediction output. Please refer to Image 6 It shows a network structure diagram of a predictor according to an embodiment of the present application. like Image 6 As shown, the Predictor network uses four non-empty historical outputs to predict the current output frame. That is, after the four non-empty history output 61 of the current input passes through the vector map, input to the one-dimensional volume network 62 to obtain a text hidden layer representation vector.
[0129] In the present application embodiment, the acoustic model can be obtained by pre-set speech samples, and actual phonology training of each voice frame in the speech signal sample. For example, in the training process, an audio frame in the voice sample is input into the acoustic model based on the FSMN-based Encoder network, and the actual number of actual phono of the top 4 non-empty voice frames of the voice frame (there is no history When non-empty voice frames, or history non-empty voice frames, it can be replaced by pre-seting phonemes), input to a Predictor network based on one-dimensional volume, in the process of processing input data, in the acoustic model, in the acoustic model The parameters of the three parts (Encoder, Predictor, and the joint network) are updated, making the sum of the probability on all possible alignment paths, that is, the result of the above formula (2) maximizes, thereby realizing training on acoustic models.
[0130] Step 303, the probability of suppressing adjustment in the empty output in the phoneme identification result of the respective speech frames to reduce the ratio of the probability of the empty output in the phoneme identification result.
[0131] In a possible implementation, the probability of an empty output in the phoneme identification result of each speech frame is suppressed, including:
[0132] The phoneme recognition result corresponding to the respective speech frames is adjusted by at least one of the following adjustment methods:
[0133] Decrease the probability of empty output in the phoneme identification result of each speech frame;
[0134] And, improve the probability of respective phoneme recognition results in this voice frame corresponding to the respective speech frames.
[0135] In a possible implementation, the probability of an empty output in the sound frame corresponding to the respective speech frames is reduced, including:
[0136] The probability of the empty output in the speech frame corresponding to the respective speech frames is multiplied by the first weight, the first weight of less than 1 and greater than 0.
[0137] In the present application embodiment, the probability of an empty output in the results of the phoneme identification can be suppressed, and only the probability of the empty output in the sounding identification result can be reduced, for example, multiplied by one in the probability of the empty output in the phoneme recognition result. The number of 0 to 1. In the case where the probability of the respective number of phonemes is constant in the resolution of the phoneme recognition, the ratio between the probability of the empty output and the probability of each phoneme can be reduced.
[0138] In a possible implementation, the probability of an empty output in the sound frame corresponding to the respective speech frames is reduced, including:
[0139] The probability of each of the respective phoneme recognition results in the respective speech frame is multiplied by the second weight, the second weight greater than 1.
[0140] In the present application embodiment, the probability of an empty output in the soundudinal output can be suppressed, and only the probability of the empty output in the lower cell recognition result can be improved, for example, multiplied by one greater than the probability of each phoneme in the phoneme recognition result. 1 number. In the case where the probability of the empty output in the sound output of the sound output can reduce the ratio between the probability of the empty output and the probability of each of the phonemes.
[0141] In another exemplary solution, the computer device can also improve the probability of each of the sounds in the phoneme identification result while reducing the empty output of the phoneme recognition. For example, multiplied by a number of empty outputs between the phoneme recognition results, at the same time, multiplying a number of greater than 1 in the probability of each of the phoneme recognition results.
[0142] In this scenario, the acoustic model is in order to obtain an alignment before the input and output, in the input phoneme sequence, you need to insert an empty output symbol, ie Symbols and other phonemes, using models for prediction. Assume that the total number of non-airprints is P, the output dimension of the final model is P + 1, usually 0nd dimension represents an empty output The experiment found that the introduction of empty output, enabling the model's deletion error, which means that a large number of phonemes are erroneously identified as empty output, in order to solve the problem of excessive empty output probability, this application is in the transducer decoding process by adjusting empty The output probability weight is reduced to delete errors.
[0143] Taking the probability of the empty output in the sound output of the respective speech frames by the first weight as an example, it is assumed that the probability of the empty output is In order to reduce the probability of an empty output, the present scheme is on the original empty output probability value, divided by a weight α, α> 1, α greater than 1, α> 1, α, and the empty output probability value after the adjustment is:
[0144]
[0145] In general, the logarithmic probability is used as the final value to participate in the final decoding fraction calculation, so after the logarithm is taken on the equation (3), it can be obtained:
[0146]
[0147] The results of the above formula (4) can be adjusted as an empty output to perform subsequent decoding.
[0148] In a possible implementation, the first weight or second weight is pre-placed in advance in a computer device, for example, the first weight or second weight can be pre-set by the developer. The speech recognition model.
[0149] In step 304, in the respective voice frame corresponding to each speech frame, the probability of the empty output satisfies the speech cathode identification result input decoding map to obtain the identification text sequence corresponding to the speech signal.
[0150] In a possible implementation, the phoneme recognition result corresponding to the respective speech frames will be adjusted to the decoding map, and the identification text sequence corresponding to the speech signal, including:
[0151] In response to the probability of the empty output in the target phoneme recognition satisfies the specified condition, the target phoneme recognition result is input to the decoding map, and the identification text corresponding to the target phoneme recognition result is obtained;
[0152] Wherein, the target phone recognition result is any of the voltmeter recognition results corresponding to the respective speech frames.
[0153] In a possible implementation, this designation includes:
[0154] The probability of the empty output in the target phone identification result is less than the probability threshold.
[0155] In the experiment, it was found that the output of the Transducer model has a significant spike effect than the DNN-HMM model, that is, at a certain time, the model will output a predicted result at a very high confidence. Using the spike effect of the model, we can skip the model predict the probability of empty output in the decoding process, that is, these probability will not participate in the decoding process of the decoding map, because this patent is a modeling unit, and at the same time, When decoding, skip overhead, then the number of steps of the decoding graph search is related to the number of phonemes, and the "PhoneSynchronous decoding, PSD) is called" PhoneSynchronous decoding, PSD). The following picture gives the PSD proposed in this scheme. The entire process of algorithm and empty output weight adjustment:
[0156] Algorithm 1: PSD algorithm;
[0157]
[0158]
[0159] Among them, the sixth line in the above algorithm performs the adjustment of weight, β in the algorithm blank That is, the PSD algorithm proposed in the above algorithm, which is the PSD algorithm proposed in this scheme, ie, only the probability of the empty output is less than a certain threshold γ. blank When the probability distribution of the network output will participate in the decoding of subsequent decoded graphs.
[0160] In a possible implementation, the probability threshold is preset by a developer or manager in advance in a computer device, for example, the probabilistic threshold can be pre-set by the developer in the speech recognition model.
[0161] In a possible implementation, the phoneme recognition result corresponding to the adjusted respective speech frames is input to decoding the diagram, and it is also included before the identification text sequence corresponding to the voice signal.
[0162] Get the threshold influence parameters, the threshold influence parameter includes ambient sound intensity, the number of visible failures in the specified time period, and at least one of the user setting information;
[0163] This probability threshold is determined based on the threshold affects the parameters.
[0164] In the present application embodiment, the probability threshold can also be adjusted by a computer device during a speech recognition process. That is, the computer device can acquire the related parameters that may affect the value of the probability threshold, and flexibly set the probability threshold by related parameters.
[0165] For example, ambient sound intensity can cause interference to the user, so when the ambient sound intensity is strong, the computer device can set the probability threshold setting, so that more phoneme recognition results are output to decode the decoding map. Thus, the accuracy of the identification is guaranteed. Conversely, when the ambient sound intensity is weak, the computer device can set the probability threshold setting, so that more phoneme identification results are skipped, thereby ensuring the identification efficiency.
[0166]For example again, the accuracy of decoding of the speech recognition result can affect the success rate of speech recognition. When the specified time period (such as a period of time before the current time, for example, 5 minutes), the number of times is too large, computer equipment The probability threshold can be set, so that more phoneme recognition results are output to decode the decoding map, thereby ensuring the accuracy of the identification; in turn, when the number of speech recognition failures in the specified time period is less, the computer is not failed, the computer The device can set the probability threshold setting, so that more phoneme recognition results are skipped, thereby ensuring identification efficiency.
[0167] In a possible implementation, the decoding diagram consists of a composite of the phoneme dictionary and the language model.
[0168] The decoding map used in this scenario is composed of two sub-weighted finite automators (WeightedFinite State Transducer, WFST) maps.
[0169] Venus dictionary Wfst: mapping of Chinese characters or words to phoneme sequences. Enter a phoneme sequence string, WFST can output the corresponding Chinese characters or words; usually, this WFST is independent of the text field, and is a general part in different identification tasks;
[0170] Language Model WFST: This WFST is typically converted by N-Gram language model, and the language model is used to calculate the probability of a sentence, using training data and statistical methods. Generally, text, such as news and speaking conversations, such as news and speaking conversations, have a large difference, so when performing speech recognition in different domains, the language model WFST can be achieved to achieve adaptation.
[0171] Please refer to Figure 7 It shows the model training and application flow charts according to the embodiments of the present application. like Figure 7 As shown in the case of a vehicle device, the model training is used after the model training shown in the present application, and the model is quantized and deployed by Libtorch. Libtorch's Android version, using the QNNPack library for int8 matrix calculation, greatly accelerated matrix operation speed. The model is trained in the Python Environment 71, and then quantizes the model after training, that is, quantifies the model parameters to int8, and uses INT8 matrix multiplication to accelerate the calculation. After the quantified model is exported, used for C ++ environments The forward inferior of 72 is tested by test data.
[0172] Through the scheme shown in this application, on the one hand, based on the TRANSDUCER-based end-to-end model, the modeling process is greatly simplified without frame-level alignment information; secondly, simplify the decoding diagram, reduce the search space. The method proposed in this plan is required by using a phoneme modeling. The decoding map only needs to comply with G, and the search space is greatly reduced. Finally, use phoneme modeling, combined with custom decoding maps, can achieve flexible customization requirements, can be adapted to different business models according to different business scenarios, only need to customize the language model without changing the acoustic model, can adapt to their respective services Scenes.
[0173] Compared to offline recognition systems in the related art, this program has advantages in the identification rate and CPU usage:
[0174] In terms of recognition rate, the system model shown in this scheme has increased significantly compared to the Hidden Markov Model, HMM system model (DNN-HMM model) compared to the DNN Markov Model, HMM.
[0175] In terms of CPU occupation, the system model shown in this scheme is 4 times the model parameter of the DNN-HMM system, still has a similar CPU usage with the DNN-HMM system model.
[0176] The speech recognition rate is compared to:
[0177] Table 1 below shows the comparison of the word DNN-HMM system and the character error rate (CER) of the transducer system proposed in this scheme.
[0178] Table 1
[0179] Model Parameter quantity Test set 1CER (%) Test set 2Cer (%) DNN-HMM 0.7M 14.88 19.77 TRANSDUCER1 0.8M 12.1 16.09 Tansducder2 1.9M 9.76 13.4 Tansducder3 2.1M 8.93 13.18
[0180] From Table 1, it can be found that at the same parameter amount, the TRANSDUCDER1 model has obtained a decrease in CER in the CER of 18.7% and 18.6%, respectively. At the same time, when the model parameters are increased, it takes 8.93% and 13.18% of the word error rate.
[0181] CPU usage comparison:
[0182] Table 2
[0183] Model Parameter quantity CPU occupies (peak) DNN-HMM 0.7M 16% TRANSDUCER1 0.8M 18% Tansducder2 1.9M 20% Tansducder3 2.1M 20%
[0184] By comparing TRANSDUCER1 and DNN-HMM, the two models are at the same parameters, the TRANSDUCER1 model is 2% higher than the peak of the DNN-HMM model, but the peak value of the TRANSDUCER model does not change significantly when the amount of model parameters is increased. The CPU usage is still maintained at a lower level under the conditions of the model parameter quantity and the reduction of the recognition error rate.
[0185] In summary, the scheme shown in the present application, for the speech chamber recognition result for the probability distribution of the probability distribution of the speech frame, in each phoneme and the empty output, before inputting the phoneme recognition result, first on the phoneme recognition result The probability of empty output is suppressed, and the probability of reducing the speech frame is identified as an empty output, thereby reducing the likelihood that the speech frame is identified as an empty output, that is, reducing the deletion of the model, thereby increasing the identification accuracy of the model.
[0186] This application image 3 The scheme in the illustrated embodiment is simultaneously applied to the empty output weight adjustment (step 303) and the decoded jumper (corresponding step 304) as an example, in other implementations, empty output weight adjustment and decoding jump frame can also be independent. application. For example, in an exemplary embodiment of the present application, the scheme shown in this application can be as follows:
[0187] Get a voice signal, the speech signal includes individual speech frames obtained by dividing the original voice;
[0188] The voice signal is processed by the acoustic model to obtain a voice frame corresponding to the respective speech frames; the phoneme recognition result is used to indicate the probability distribution of the corresponding speech frame in the phoneme space; the phoneme space contains each phoneme and an empty output. The acoustic model is obtained by speech signal samples, and actual phonology training of each voice frame in the speech signal sample;
[0189] In the respective speech frame recognition result of each speech frame, the probability of the empty output satisfies the sounding diagram input decoding diagram of the specified condition, obtains the identification text sequence corresponding to the speech signal.
[0190] In summary, the scheme shown in the present application, for the speech chamber recognition result including the probability distribution of the speech frame in each of the respective phonemes and the empty output, can input the phoneme identification result, can be output The probability satisfies the conditional phoneme recognition results to decode, reduce the number of factors that need to decode, and skip unnecessary decoding steps, thereby effectively improving speech recognition efficiency.
[0191] Please refer to Figure 8 It is a frame diagram of a speech recognition system according to an exemplary embodiment. like Figure 8 As shown, the audio acquisition device 81 is connected to the speech recognition device 82, and the speech recognition device 82 includes an acoustic model 82a, a probability adjustment unit 82b, a decoded graph input unit 82c, a decoding diagram 82D, and a feature extraction unit 82e. Among them, the decoding Figure 82D consists of a soundme dictionary and a language model.
[0192] During the application, the audio collection device 81 collects the original voice of the user, transmitting the original voice to the feature extracting unit 82e in the speech recognition device 82, divided by the feature extraction unit and performs feature extraction of each voice frame, A voice feature of a voice frame, and the speech core of the text identified by the previous four non-empty speech frames of the speech frame, which input into the FSMN and one-dimensional volume network in the acoustic model 82a, respectively, to obtain an acoustic model 82A outputs the phoneme recognition result of the speech frame.
[0193] The phoneme recognition result is input to the probability adjustment unit 82b, the probability adjustment of the empty output is adjusted, and the adjusted speech recognition result is determined by the decoded map input unit 82c, and when the adjusted empty output is determined When the probability is less than the threshold, it is determined that the decoding needs to be decoded, and the decoding graph input unit 82c inputs the adjustment of the phoneme recognition result, and the text is identified by the decoding map 82d; contrary, if the adjustment is not less than the threshold When decoding is determined, the adjusted speech recognition result is discarded.
[0194] The above-described decoding diagram recognizes the adjustment of the adjusted sound frames of each speech frame, and outputs a text sequence to the natural language processing component, and responds to the voice of the user input by the natural language processing components.
[0195] Figure 9 It is a configuration block diagram of a speech recognition apparatus according to an exemplary embodiment. The voice recognition device can be implemented figure 2 or image 3 All or some steps in the methods provided by the illustrated embodiments. The speech recognition device can include:
[0196] The voice signal processing module 901 is configured to process the voice signal by the acoustic model to obtain a voice signal corresponding to each speech frame in the voice signal; the probability of the plurality of speech frames is used to indicate the probability of the corresponding speech frame in the phoneme space. Distribution; the plurality of phones contain each phoneme and an empty output; the acoustic model is obtained by speech signal samples, and actual phonology training of each voice frame in the speech signal sample;
[0197] The probability adjustment module 902 is configured to suppress adjustment to the probability of empty output in the plume identification result of the respective speech frames to reduce the probability of the empty output in the probability of the effect and the probability of the probability of each phoneme. ;
[0198] The decoding module 903 is configured to input the plurality of the plurality of vagins recognition results corresponding to the adjusted respective speech frames to obtain the identification text sequence corresponding to the speech signal.
[0199] In a possible implementation, the probability adjustment module 902 is configured to adjust the plume identification result corresponding to the respective speech frames by at least one of the following adjustment modes:
[0200] Reduce the probability of empty output in the plurality of speech frames corresponding to the respective speech frames;
[0201] as well as,
[0202] Improve the probability of each of the plurality of phonode identification results corresponding to each speech frame.
[0203] In a possible implementation, the probability adjustment module 902 is configured to multiply the probability of the empty output in the plurality of the plurality of speech frames in the first weight, the first weight of less than 1. And greater than 0.
[0204] In a possible implementation, the probability adjustment module 902 is used to multiply the probability of each of the plurality of phonode identification results corresponding to the respective speech frames by the second weight, the second weight greater than 1 .
[0205]In a possible implementation, the decoding module 903 is for,
[0206] In response to the probability of the empty output in the target phoneme recognition satisfies the designation condition, the target phoneme recognition result is input to the decoding map to obtain the identification text corresponding to the target phone core;
[0207] Wherein, the target phoneme identification result is any of the plurality of pyene recognition results corresponding to the respective speech frames.
[0208] In a possible implementation, the designated condition comprises:
[0209] The probability of the empty output in the target phoneme recognition result is less than the probability threshold.
[0210] In a possible implementation, the apparatus further comprises:
[0211] Parameter acquisition module, used to obtain threshold influence parameters, the threshold influence parameter includes ambient sound intensity, the number of times the voice identification failure within the time period, and at least one of the user setting information;
[0212] The threshold determination module is configured to determine the probability threshold based on the threshold.
[0213] In a possible implementation, the speech signal processing module 901 is for,
[0214] Featured extraction of the target speech frame, obtaining a feature vector of the target speech frame; the target speech frame is any of the respective speech frames;
[0215] Enter the target speech frame into the encoder in the acoustic model, obtain an acoustic hidden layer representation of the target speech frame;
[0216] Enter a predictor in the acoustic model in the synchronization of the synchronization of the synchronization of the target speech frame to obtain a text hidden layer representation of the target speech frame; the history identification text of the target speech frame is the decoding The graph is identified by the measurement result of the phoneme recognition result of the previous N non-empty output of the target speech frame; N is an integer greater than or equal to 1;
[0217] The acoustic hidden layer of the target speech frame represents the vector, and the text hidden layer of the target speech frame indicates the vector input integration network to obtain the pyrographic frame of the target speech frame.
[0218] In a possible implementation, the encoder is a forward sequence memory network FSMN.
[0219] In a possible implementation, the predictor is a one-dimensional volume network.
[0220] In a possible implementation, the decoding diagram is composed of a soundme dictionary and a language model.
[0221] In summary, the scheme shown in the present application, for the speech chamber recognition result for the probability distribution of the probability distribution of the speech frame, in each phoneme and the empty output, before inputting the phoneme recognition result, first on the phoneme recognition result The probability of empty output is suppressed, and the probability of reducing the speech frame is identified as an empty output, thereby reducing the likelihood that the speech frame is identified as an empty output, that is, reducing the deletion of the model, thereby increasing the identification accuracy of the model.
[0222] Figure 10 It is a configuration diagram of a computer device according to an exemplary embodiment. The computer device can be implemented as a computer device in various method embodiments described above. The computer device 1000 includes a central processing unit 1001 comprising a system memory 1004 of a random access memory (RAM) 1002, and read-onlymemory, ROM) 1003, and a connection system memory 1004 and a central processing unit. The system bus 1005 of 1001. The computer device 1000 also includes a basic input / output system 1006 that helps transmit information between various devices within the computer, and a large capacity storage device 1007 for storing operating system 1013, application 1014, and other program modules 1015.
[0223] The mass storage device 1007 is connected to the central processing unit 1001 by a large capacity storage controller (not shown) connected to the system bus 1005. The large capacity storage device 1007 and the associated computer readable medium thereof provide non-volatile storage to computer device 1000. That is, the large capacity storage device 1007 can include a computer readable medium (not shown) such as a hard disk or a COMPACT DISC Read-Only Memory, a CD-ROM driver.
[0224] Without generality, the computer readable medium can include a computer storage medium and a communication medium. Computer storage media includes volatility and nonvolatile, removable and non-removable media for storing any methods or techniques such as computer readable instructions, data structures, program modules, or other data. Computer storage media includes RAM, ROM, flash memory, or other solid state storage therefor, CD-ROM, or other optical storage, tape cassettes, tape, disk storage, or other magnetic storage devices. Of course, those skilled in the art can know that the computer storage medium is not limited to the above. The above system memory 1004 and a large capacity storage device 1007 can be collectively referred to as a memory.
[0225] Computer device 1000 can be connected to the Internet or other network device by a network interface unit 1011 connected to the system bus 1005.
[0226] The memory further includes at least one computer instruction, the at least one computer instruction being stored in the memory, and the processor is implemented by loading and executing the at least one computer instruction. figure 2 or image 3 All or some steps of the method shown.
[0227] In an exemplary embodiment, a non-contextic computer readable storage medium including instructions, such as a memory, including a computer program (instruction), which can be executed by the processor of the computer device to complete this application. The method shown in various embodiments. For example, the non-temporary computer readable storage medium can be a read-only memory (ROM), random access memory (RAM), read-only CD (Compact Disc Read-Only Memory, CD) -Rom), tape, floppy disk and optical data storage device, etc.
[0228] In an exemplary embodiment, a computer program product or computer program is also provided, which includes computer instructions, which are stored in a computer readable storage medium. The processor of the computer device reads the computer instruction from the computer readable storage medium, and the processor executes the computer instruction such that the computer device performs the method shown in various embodiments described above.
[0229] Other embodiments of the present application will be readily apparent to those skilled in the art. The present application is intended to cover any variations, uses, or adaptive changes in the present application, these variations, uses, or adaptive changes follow the general principles of the present application and include known common sense or customary techniques in the art of the present invention disclosed herein. . The instructions and examples are only considered exemplary, and the true scope and spirit of the present application are pointed out by the claims.
[0230] It should be understood that the present application is not limited to the exact structure shown above and illustrated in the drawings, and various modifications and changes can be made without departing from their extent. The scope of this application is limited only by the appended claims.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more PUM


Description & Claims & Application Information
We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more Similar technology patents
Set battery control method and set battery control circuit as well as charging circuit and battery pack having the set battery control circuit
InactiveUS20100019725A1Reduce the differenceReduce the possibilityCharge equalisation circuitElectric powerTerminal voltageControl circuit
Owner:PANASONIC CORP
System and method for faucet installations
ActiveUS8272083B1Reduce the possibilityPossibility of leakageNutsDomestic plumbingValve stemEngineering
Owner:LISTON THOMAS D
Method of Attaching an Rfid Tag to a Component, a Component Comprising an Rfid Tag and Rfid Tag
InactiveUS20080309497A1Reduce the possibilityImproved protection against damageRecord carriers used with machinesBurglar alarm by hand-portable articles removalControl electronicsEmbedded system
Owner:BRYANT KEITH C
CPU-controlled, rearming electronic animal trap with three-killing-plate configuration
ActiveUS20050044775A1Power capabilityReduce the possibilityAnimal huntingElectric shock equipmentsHigh voltage pulseHigh pressure
Owner:WOODSTREAM CORP
Pipe assembly and a method for installation in a borehole
InactiveUS20070053749A1Reduce the possibilityReduce harmPipe laying and repairDrilling rodsCoatingArchitectural engineering
Owner:DRAGN SKIN INT
Classification and recommendation of technical efficacy words
- Reduce the possibility
System and method of providing tokenization as a service
ActiveUS20130198080A1Reduce the possibilityReduce riskFinanceDebit schemesService systemService provision
Owner:VISA INT SERVICE ASSOC
Cas9-foki fusion proteins and uses thereof
ActiveUS20150071899A1Strong specificityReduce the possibilityFusion with DNA-binding domainPeptide/protein ingredientsResearch settingNuclease
Owner:PRESIDENT & FELLOWS OF HARVARD COLLEGE
Prosthetic heart valves
InactiveUS20080082163A1Reduce the possibilityGood blood compatibilityHeart valvesProsthesisBiomedical engineering
Owner:ST JUDE MEDICAL LLC
Method and system for protecting a user's password
ActiveUS20050273625A1Reduce the possibilityMemory loss protectionError detection/correctionRandom patternPassword
Owner:LENOVO GLOBAL TECH INT LTD
Method for quickly searching for public land mobile networks
ActiveCN103906180AShorten the timeReduce the possibilityAssess restrictionReal-time computingGeolocation
Owner:LEADCORE TECH