Balance method of actual scene linguistic data and finite state network linguistic data

A finite state and corpus technology, applied in the computer field, can solve the problems of fixed syntactic forms, incomplete vocabulary, and corpus not close to practicality, etc., and achieve the effect of comprehensive vocabulary

Inactive Publication Date: 2009-12-02
INST OF AUTOMATION CHINESE ACAD OF SCI
View PDF0 Cites 29 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, the "Wizard of Oz" experiment is not a real application scenario, and the corpus obtained through this implementation is not close to practical; in addition, when obtaining the original training corpus from the network, due to the huge amount of network data and the limitations of search engines, from This part of the corpus (although as many as 30GB) extracted from the network may not be able to cover the words and syntax of the limited domain
In addition, some simple dialogue systems use the training corpus generated by FSN syntax rules to train the language model, and have achieved good results. However, the training corpus generated by this method has a fixed syntax and incomplete vocabulary, so it is only suitable for simple dialogues with fixed forms. Applications

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Balance method of actual scene linguistic data and finite state network linguistic data
  • Balance method of actual scene linguistic data and finite state network linguistic data
  • Balance method of actual scene linguistic data and finite state network linguistic data

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0032] In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be described in further detail below in conjunction with specific embodiments and with reference to the accompanying drawings.

[0033] The purpose of the present invention is achieved like this:

[0034] 1. Generate FSN corpus

[0035] FSN is a commonly used grammatical structure representation. It was originally used primarily in rule-based speech recognition systems as a search network. This article uses the concept of FSN to design syntax rules, and uses related programs to generate corpus, which will be used to train n-gram language models. The n-gram language model is a type of statistical language model. Statistical language models usually use the chain rule to estimate the probability of a sentence:

[0036] P ( s ) = P ( ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a balance method of actual scene linguistic data and finite state network (FSN) linguistic data. In order to train the language model of a continuous speech recognizer, training linguistic data is produced according to the application field of the speech recognizer. Linguistic data mainly has two resources: one part is an actual scene linguistic data which is obtained by tidying records in actual scene, and the other part is FSN linguistic data generated with a finite state network syntactic rule method. Two linguistic data balance methods are mainly researched in the invention, and the invention provides a method that probability comparison of the keyword shared by the actual scene linguistic data and the FSN linguistic data is taken as basis, and a certain multiple of parts of actual scene linguistic data are used for expanding the FSN linguistic data to obtain the final method of language model training linguistic data. The language model for training linguistic data, which is obtained with the method, greatly improves the recognition performance of the continuous speech recognizer.

Description

technical field [0001] The invention belongs to the field of computer technology, and relates to a language model of a continuous speech recognizer, in particular to the problem of making training corpus for a language model in a voice question answering system in a limited field, in particular to a finite state network that considers actual scene corpus and is designed for the actual scene (Finite StateNetwork, FSN) corpus of balanced methods. Background technique [0002] One of the main difficulties in the training of domain-limited language models is the sparse training data. The research on this problem mainly focuses on two aspects: one is the expansion of the corpus, and the other is the corpus smoothing algorithm. The corpus smoothing algorithm does not solve the problem of data sparsity from the root, but can only solve the problems caused by data sparsity to a certain extent, and some algorithms have some shortcomings, such as the Good-Turning Discounting algorithm...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G10L15/06
Inventor 李成荣熊军军
Owner INST OF AUTOMATION CHINESE ACAD OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products