Word segmentation method based on MMseg algorithm and pointwise mutual information algorithm

A technology of point-by-point mutual information and word segmentation method, which is applied in computing, natural language analysis, special data processing applications, etc., can solve the problems of slow word segmentation speed and low word segmentation efficiency, and achieve the effect of improving accuracy and ambiguity resolution ability

Inactive Publication Date: 2017-03-22
SUN YAT SEN UNIV
View PDF2 Cites 13 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0007] After analysis and research, it is found that the traditional dictionary-based word segmentation method relies heavily on professional dictionaries, and has insufficient ability to recognize some unregistered words during word segmentation, while the statistical-based word segmentation method requires a large number of corpus training in advance, and the word segmentation efficiency is low. , word segmentation is slow

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Word segmentation method based on MMseg algorithm and pointwise mutual information algorithm
  • Word segmentation method based on MMseg algorithm and pointwise mutual information algorithm
  • Word segmentation method based on MMseg algorithm and pointwise mutual information algorithm

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0037] Such as figure 1 As shown, the method provided by the present invention uses the MMseg algorithm to perform word segmentation processing on the text based on the dictionary, and uses the point-by-point mutual information algorithm to correct the word segmentation results after obtaining the word segmentation results;

[0038] The specific process of correcting word segmentation results by the point-by-point mutual information algorithm is as follows: calculate the point-by-point mutual information of adjacent words x and y in the text, and then judge whether the point-by-point mutual information of words x and y is greater than the set Threshold, if so, divide word x and word y as an independent word.

[0039] Among them, the MMseg algorithm is a dictionary-based word segmentation algorithm, and the interpretation of MMSeg can be divided into two parts: "matching algorithm" and "disambiguation" rules. The matching algorithm includes two word segmentation methods: simpl...

Embodiment 2

[0066] In this embodiment, a specific experiment is carried out based on the method in Embodiment 1. In this experiment, web crawler software was used to capture 5,000 product title description information from a shopping website as the experimental corpus, and some meaningless symbols such as punctuation, underscore, and special symbols were filtered out, of which 3,500 were used as training texts, and 1,500 As the test text, the statistics of the experimental results are as follows:

[0067] data set number of sentences word count single word / multiple word training corpus 3500 54834 2541 / 34156 test corpus 1500 24545 1154 / 14354

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a word segmentation method based on an MMseg algorithm and a pointwise mutual information algorithm. A text is subjected to word segmentation processing by the MMseg algorithm based on a dictionary, and a word segmentation result is corrected by the pointwise mutual information algorithm after the word segmentation result is obtained. A specific process of correcting the word segmentation result by the pointwise mutual information algorithm comprises the following steps: calculating pointwise mutual information of a character x and a character y which are adjacent to each other in the text; judging whether the pointwise mutual information of the character x and the character y is larger than a set threshold value or not; and if so, segmenting the character x and the character y as an independent word.

Description

technical field [0001] The present invention relates to the field of Chinese word segmentation, and more specifically, relates to a word segmentation method based on MMseg algorithm and point-by-point mutual information algorithm. Background technique [0002] my country's research on natural language processing started relatively late, and it only established its own natural language processing model in the 1980s. Later, with the development of computers and the improvement of users' own needs, the domestic emphasis on natural language has greatly increased. The number of research institutions has increased and the research team has grown. The research team combined the characteristics of Chinese texts while drawing on foreign achievements, and proposed a new theoretical model to improve the level of research on Chinese understanding. [0003] There are spaces between words in English word segmentation, but in Chinese text, characters between sentences are connected togethe...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27
CPCG06F40/20G06F40/284
Inventor 谭军张凯华
Owner SUN YAT SEN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products