Word segmentation method and device

A word segmentation method and word segmentation technology, applied in the field of information processing, can solve problems such as estimation, and achieve the effect of flexible addition

Active Publication Date: 2014-03-26
AISPEECH CO LTD
View PDF4 Cites 32 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] If you want to add new words to the n-gram language model, you need to estimate the probability for the new words, wh...

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Word segmentation method and device
  • Word segmentation method and device
  • Word segmentation method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0037] In order to make the object, technical solution and advantages of the present invention clearer, the implementation manner of the present invention will be further described in detail below in conjunction with the accompanying drawings.

[0038] figure 1 It is a flow chart of the word segmentation method provided by the embodiment of the present invention. see figure 1 , the example includes:

[0039] 101. Using the n-gram model, the text to be segmented is segmented to obtain a first text, the n-gram model is used to eliminate word segmentation ambiguity, and the first text includes word strings separated by spaces;

[0040] In the embodiment of the present invention, the n-gram model refers to the approximation of the occurrence of characters in the language as a (n-1) order markov model, that is, there are Chinese character strings c1, c2, ..., ci, In its context, only the first n-1 characters have an impact on the probability of the next character, that is, the a...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a word segmentation method and a word segmentation device, and belongs to the field of information processing. The method comprises the following steps of performing word segmentation on a text to be subjected to the word segmentation to obtain a first text by using an nth-order Markov model; when the first text comprises a target word string, adding the target word string to a dictionary to obtain an updated dictionary, wherein the target word string is not stored in the dictionary, and the dictionary is used for storing all word strings and corresponding estimated probabilities; performing word segmentation on the first text to obtain a second text and a third text according to the updated dictionary and a preset algorithm by utilizing a forward maximum matching word segmentation method and a backward maximum matching word segmentation method respectively; selecting the text of which the word length expectation and the word length variance are consistent with preset rules from the second and third texts as a word segmentation result. According to the method, a conventional dictionary is updated only by adding new words into the conventional dictionary, so that the new words can be flexibly added on the premise of not increasing the word segmentation ambiguities.

Description

technical field [0001] The invention relates to the field of information processing, in particular to a word segmentation method and device. Background technique [0002] Chinese word segmentation refers to dividing a sequence of Chinese characters into individual words. Chinese word segmentation plays an important role in the fields of information retrieval, machine translation and speech recognition, and is an indispensable link in the process of Chinese speech processing. Generally, due to the problem of word segmentation ambiguity, the accuracy of traditional dictionary-based mechanical word segmentation methods cannot reach 100%. For example, "Nanjing Yangtze River Bridge" can be divided into "Nanjing Yangtze River Bridge", and can also be divided into "Nanjing Yangtze River Bridge". Both word segmentation methods seem reasonable if you do not rely on other knowledge. [0003] In order to solve the above problem of word segmentation ambiguity, in the prior art, an n-...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/27G06F17/30
Inventor 王欢良薛峰惠寅华赵鹏程俞凯
Owner AISPEECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products