A corpus training method and system

A training method and corpus technology, applied in the field of corpus training systems, can solve the problems of limited use of speech corpora, high cost of speech corpus, and high cost of data collection, and achieve the effects of rich corpus, low cost, and accurate training results

Pending Publication Date: 2019-01-15
XIAMEN KUAISHANGTONG INFORMATION TECH CO LTD
View PDF5 Cites 12 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

In turn, the cost of data collection is too high, which in turn leads to the high cost of speech corpus, which limits the use of speech corpus

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A corpus training method and system
  • A corpus training method and system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0033] In order to make the technical problems, technical solutions, and beneficial effects to be solved by the present invention clearer and more comprehensible, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, but not to limit the present invention.

[0034] Such as figure 1 As shown, a corpus training method of the present invention includes the following steps:

[0035] a. Obtain text corpus through a web crawler, and perform type screening on the text corpus;

[0036] b. Perform word segmentation and sentence processing on the filtered text corpus;

[0037] c. Pinyin phonetic notation on the text corpus after word segmentation;

[0038] d. Input the text corpus after word segmentation and its corresponding pinyin into the language model for training to obtain a classified corpus or a classif...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a training method and a system of a corpus. The corpus is obtained by a web crawler and the type of the corpus is screened. Word segmentation and sentence segmentation are carried out on the selected text corpus. Pinyin and phonetic notation are carried out on the text corpus after the participle clause; inputting the text corpus and its corresponding phonetic alphabet intothe language model for training to obtain the classification corpus pool or classification corpus; it is not necessary to configure a special recording studio and spend a lot of time to record the corpus, but to obtain the corpus directly through the web crawler and to process the corpus by word segmentation, sentence segmentation and phonetic notation, so that the required corpus pool or corpuscan be obtained with lower cost and better versatility.

Description

Technical field [0001] The invention relates to the field of artificial intelligence technology, in particular to a corpus training system and a corresponding method. Background technique [0002] Speech recognition is an application of artificial intelligence and machine learning tasks. Among them, machine learning tasks are generally divided into two processes: training and prediction: the training process summarizes the known samples to form a model; the prediction process uses the model to solve the unknown The sample makes predictions. Then the predicted result will depend on the completeness and accuracy of the model. The machine learning task conforms to the Bayesian principle. The Bayesian formula is as follows: P(h|D)=P(D|h)*p(h) / P(D), where D is the sample set and h is the hypothesis space In the model, P(h|D) is the posterior probability that the conditional probability of h appearing when D has already appeared. The basic meaning of Bayesian formula is to maximize ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27G06F16/35
CPCG06F40/211G06F40/284
Inventor 刘翔鹏肖龙源李稀敏蔡振华刘晓葳谭玉坤
Owner XIAMEN KUAISHANGTONG INFORMATION TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products