Generation method and device for word segmentation dictionary and word segmentation processing method and device

A word segmentation dictionary and word segmentation technology, applied in the direction of electronic digital data processing, special data processing applications, instruments, etc., can solve the problems of long update cycle, dependence on existing word segmentation devices, and inability to recognize unregistered entries or corpus, etc., to achieve Improve effects, increase spawn speed and effects of effects

Active Publication Date: 2015-09-09
BAIDU ONLINE NETWORK TECH (BEIJIBG) CO LTD
View PDF3 Cites 24 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] In the existing technology, whether it is dictionary generation or corpus generation, it must rely on manual screening and token segmentation, so the update cycle is long and the existing tokenizer is too dependent, and unregistered entries or corpus cannot be recognized

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Generation method and device for word segmentation dictionary and word segmentation processing method and device
  • Generation method and device for word segmentation dictionary and word segmentation processing method and device
  • Generation method and device for word segmentation dictionary and word segmentation processing method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0029] Embodiments of the present invention are described in detail below, examples of which are shown in the drawings, wherein the same or similar reference numerals denote the same or similar modules or modules having the same or similar functions throughout. The embodiments described below by referring to the figures are exemplary only for explaining the present invention and should not be construed as limiting the present invention. On the contrary, the embodiments of the present invention include all changes, modifications and equivalents coming within the spirit and scope of the appended claims.

[0030] figure 1 It is a schematic flow chart of a method for generating a word segmentation dictionary proposed in an embodiment of the present invention, the method comprising:

[0031] S11: Obtain the original sentence corpus.

[0032] Wherein, the existing data can be collected to obtain the original sentence corpus, for example, the original sentence corpus is the sentenc...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a generation method and device for a word segmentation dictionary and a word segmentation processing method and device. The generation method for the word segmentation dictionary comprises the following steps: acquiring an original sentence and language material; segmenting the original sentence and language material to obtain segments, and filtering the segments to obtain a filter result, wherein the filtering step comprises at least one of the following steps: carrying out filtering based on word frequency and inverse frequency; carrying out filtering based on boundary; and carrying out filtering based on splicing; and generating the word segmentation dictionary according to the filter result. The method, which is independent of manual screening and a word segmentation apparatus, can be used for recognizing entries which are not logged in, so that the generation speed and effect of the word segmentation dictionary are improved.

Description

technical field [0001] The invention relates to the technical field of speech processing, in particular to a method and device for generating a word segmentation dictionary and a method and device for word segmentation processing. Background technique [0002] Speech synthesis, also known as text to speech (Text to Speech), can convert text information into speech and read it out in real time, which is equivalent to installing an artificial mouth on a machine. For speech synthesis systems, the input text needs to be processed first, including word segmentation. There are two main types of word segmentation algorithms, one is an algorithm based on dictionary matching, and the other is a learning algorithm based on training corpus. The dictionary and the training corpus are the necessary data for the matching algorithm based on the dictionary and the learning algorithm based on the training corpus respectively. [0003] In the existing technology, both dictionary generation ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27
Inventor 肖朔李秀林白洁
Owner BAIDU ONLINE NETWORK TECH (BEIJIBG) CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products