Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Generation method and device of word segmentation training set

A training set and word segmentation technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve problems such as expensive economic time

Active Publication Date: 2015-08-26
BAIDU ONLINE NETWORK TECH (BEIJIBG) CO LTD
View PDF5 Cites 22 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Using human labor to do data labeling is very expensive both economically and time-consuming

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Generation method and device of word segmentation training set
  • Generation method and device of word segmentation training set
  • Generation method and device of word segmentation training set

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0019] Embodiments of the present invention are described in detail below, examples of which are shown in the drawings, wherein the same or similar reference numerals denote the same or similar modules or modules having the same or similar functions throughout. The embodiments described below by referring to the figures are exemplary only for explaining the present invention and should not be construed as limiting the present invention. On the contrary, the embodiments of the present invention include all changes, modifications and equivalents falling within the spirit and scope of the appended claims.

[0020] figure 1 It is a schematic flow chart of a method for generating a word segmentation training set proposed by an embodiment of the present invention, and the method includes:

[0021] S11: Obtain the training corpus, and use different tokenizers to segment the same training corpus respectively, and obtain word segmentation results corresponding to different tokenizers....

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a generation method and device of a word segmentation training set. The generation method of the word segmentation training set comprises the following steps: obtaining training corpus, adopting different word segmentation devices to independently carry out word segmentation on different training corpuses to obtain word segmentation results corresponding to different word segmentation devices; dividing the word segmentation results into accurate matching word segmentation results and non accurate matching word segmentation results; and according to the word segmentation results, carrying out noise reduction processing on the non accurate matching word segmentation results to obtain the word segmentation training set. The method can lower time and cost for the generation of the word segmentation training set, is low in implementation cost and improves effect.

Description

technical field [0001] The invention relates to the technical field of speech processing, in particular to a method and device for generating a word segmentation training set. Background technique [0002] Speech synthesis, also known as Text to Speech (TTS), can convert text information into speech and read it out in real time, which is equivalent to installing an artificial mouth on a machine. For speech synthesis systems, the input text needs to be processed first, including word segmentation. There are two main types of word segmentation algorithms, one is an algorithm based on dictionary matching, and the other is a learning algorithm based on training corpus. [0003] In the prior art, a conditional random field (Conditional Random Field, CRF) model is a mainstream learning algorithm based on training corpus. However, the CRF model is a supervised machine learning algorithm that requires a large amount of manually labeled data as support. The work of using human lab...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/27
Inventor 白洁李秀林肖朔
Owner BAIDU ONLINE NETWORK TECH (BEIJIBG) CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products