Character and word hybrid language model-based Chinese speech keyword retrieval method

A language model and keyword technology, applied in the field of speech keyword retrieval technology and Chinese speech key retrieval, can solve problems such as amplifying the gap between the language model scores of common words and uncommon words, affecting retrieval performance, and being easily pruned.

Active Publication Date: 2017-01-04
INST OF ACOUSTICS CHINESE ACAD OF SCI +1
View PDF1 Cites 12 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Although any vocabulary in Chinese can be obtained by splicing single characters, due to the sparsity of the language model training data and the language model scale factor (LM scale) added in the decoding process to balance the acoustic model score and the language model score ), these factors together lead to and amplify the gap between the language model scores of common words and uncommon words, so the correct path containing uncommon words is easy to be pruned during the decoding process, thus affecting the retrieval performance
If the recognition system uses a recognition dictionary based on words such as syllables or phonemes, the problem of uncommon words can be avoided, but it will cause a certain loss in the retrieval performance of common words

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Character and word hybrid language model-based Chinese speech keyword retrieval method
  • Character and word hybrid language model-based Chinese speech keyword retrieval method
  • Character and word hybrid language model-based Chinese speech keyword retrieval method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0027] The present invention will be further described below.

[0028] The method provided by the present invention first distinguishes uncommon words and common words in the training corpus according to the part of speech, and obtains the statistical law of the occurrence of uncommon words; adds a node of uncommon words in the decoding network, and then connects a substring formed by all words In the decoding network, the language model of the word determines whether to enter the uncommon word node during the decoding process. After entering, the language model of the word limits the search range, thereby alleviating the pruning of the correct decoding path caused by the sparseness of the word language model, thereby improving improved the retrieval performance of uncommon words. The specific description is as follows:

[0029] (1) if figure 1 As shown, use the part-of-speech tagging tool to process the training set, convert the words tagged as person names, place names, an...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention proposes a character and word hybrid language model-based Chinese speech keyword retrieval method and system. The method comprises the steps of 101), distinguishing non-common words and common words in a training corpus according to part of speech, adding mark information for characters forming the non-common words, and partitioning the original training corpus into a new corpus composed of the common words and symbols marked with non-common words information; building a language model of the words according to the new corpus and re-training a language model by individual characters according to the original training corpus to obtain a language model of the characters; and 102), establishing a primary decoding network and a secondary decoding network, and performing keyword retrieval based on the primary decoding network and the secondary decoding network, wherein the language model of the words determines whether a node marked with the non-common words information is accessed or not during decoding, the node marked with the non-common words information is connected with the secondary decoding network composed of all the individual characters, and the language model of the characters limits a search range after the secondary decoding network is accessed.

Description

technical field [0001] The invention belongs to the field of speech recognition, and in particular relates to a Chinese phonetic keyword retrieval method based on a mixed language model of characters and words, which can be used in the phonetic keyword retrieval technology to improve the retrieval performance of uncommon words. Background technique [0002] In the speech keyword retrieval system, in the case of not providing a speech template, there are two commonly used methods, one is acoustic keyword detection, this method will be composed of a decoding network composed of keywords and a garbage phoneme (filler) The network is connected in parallel, but its disadvantage is that the decoding network changes with the change of the keyword list; the other is a method based on continuous speech recognition with a large vocabulary, which is currently the most popular method. We define those words that are not in the recognition dictionary and do not appear in the training set ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G10L15/18
Inventor 张鹏远王旭阳潘接林颜永红
Owner INST OF ACOUSTICS CHINESE ACAD OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products