Dictionary-based method for maximum matching of Chinese word segmentations through successive one word adding in forward direction

A Chinese word segmentation and maximum matching technology, which is used in electrical digital data processing, special data processing applications, instruments, etc., can solve problems such as inability to correctly segment, slow word segmentation, and maximum matching word length, etc., to improve the response time of word segmentation, The effect of good word segmentation time and improved word segmentation accuracy

Active Publication Date: 2015-12-09
KUNMING UNIV OF SCI & TECH
View PDF2 Cites 15 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] The present invention provides a dictionary-based method for progressively adding one word maximum matching Chinese word segmentation method to solve the problems of slow word segmentation speed and inaccurate word segmentation results caused by the traditional forward maximum matching word segmentation method. Set the maximum matching word length, avoiding the traditional maximum matching method because the set maximum matching word length is too long, and perform multiple useless matches, and the word segmentation speed is slow; the maximum matching word length is too short, and cannot be correctly segmented Happening

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Dictionary-based method for maximum matching of Chinese word segmentations through successive one word adding in forward direction
  • Dictionary-based method for maximum matching of Chinese word segmentations through successive one word adding in forward direction
  • Dictionary-based method for maximum matching of Chinese word segmentations through successive one word adding in forward direction

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0022] Embodiment 1: as Figure 1-3 Shown, a kind of dictionary-based forward successively adds one word maximum matching Chinese participle method, and the step of described method is:

[0023] Step 1. Rough segmentation; remove punctuation marks, spaces, dates, numbers, English letters and other marks from the text to be segmented, set the text to be processed as A, and divide it into N short text sequences S i The set (0i short text, A={S 1 ,S 2 ,S 3 ,...S N};

[0024] Step two, such as figure 2 As shown, the short texts after rough segmentation are read in sequence one by one, denoted as S i , let each sentence sequence S i by m word W ij (0i =i1 W i2 W i3 ...W im >

[0025] Step 3, the text S after rough segmentation i Participate. Such as figure 2 As shown, the text is word-segmented.

[0026] 1) Set a word segmentation search length L slightly smaller than the maximum word length in the dictionary, L is generally slightly smaller than the maximum word ...

Embodiment 2

[0034] Embodiment 2: as Figure 1-3 Shown, a kind of dictionary-based forward successively adds one word maximum matching Chinese participle method, and the step of described method is:

[0035] Set a word segmentation search length L slightly smaller than the maximum word length in the dictionary; set the character string to be segmented as S=s 1 the s 2 the s 3 the s 4 ...s i . From the beginning of the sentence, take the first two characters s 1 the s 2 , judging s 1 the s 2 Is it a word in the dictionary, if not, specify s 1 If it is a single-character word, if it is segmented out, the length pointer of the searched text will be increased by one word to the third word, and it will be taken from the dictionary as s 2 the s 3 Carry out a new round of search and match; if s 1 the s 2 is a word in the dictionary, then add a word to the back, and judge s 1 the s 2 the s 3 Whether it is a word, if s 1 the s 2 the s 3 is not a word in the dictionary, it indicat...

Embodiment 3

[0036] Embodiment 3: as Figure 1-3 Shown, a kind of dictionary-based forward successively adds one word maximum matching Chinese participle method, and the step of described method is:

[0037] Step1. Read the text to be segmented, roughly segment the input text according to obvious separators such as punctuation, numbers, Western characters, charts, etc., and divide it into short texts; for example, divide it into a text "today's weather is particularly good" ;

[0038] Step2, the short text of rough segmentation is used as the object of further segmentation, and further word segmentation search length L=7 is set, wherein L is taken as the length less than the maximum word length in the dictionary, wherein the maximum word length is 12;

[0039] Step3. Take the first two words "today" of a short text after rough segmentation, and search for a match in the dictionary; after matching "today" exists in the dictionary, then add one word to the length pointer of the searched tex...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a dictionary-based method for maximum matching of Chinese word segmentations through successive one word adding in the forward direction, and belongs to the technical field of computer Chinese text processing. The method comprises the steps that a text to be segmented is read in at first, and the input text is coarsely segmented according to obvious separators such as punctuations, figures, western languages and charts into independent short texts; the coarsely-segmented short texts are used as objects to be further segmented, and the further word segmentation search length is set; the coarsely-cut short texts are matched with the dictionary for word segmentation in the way of successive one word adding in the forward direction until word segmentation of all the short texts is finished. The defect that traditional forward-direction maximum matching word segmentation speed and accuracy are difficult to balance is avoided, and the word segmentation speed and accuracy are improved compared with traditional forward-direction and reverse-direction maximum matching word segmentation algorithms.

Description

technical field [0001] The invention relates to a dictionary-based method for sequentially adding one character to a maximum matching Chinese word segmentation method, which belongs to the technical field of computer Chinese text processing. Background technique [0002] With the development of science and technology, human society has entered the information age. It has become a beautiful vision to let computers "understand" human's natural language and realize free human-computer interaction. For human language, word is the smallest, independent and meaningful language unit. There are great differences between Chinese and English, French and other western languages. There are obvious spaces between words in Western languages ​​as separators, and the computer can easily understand the meaning of a sentence based on these spaces; while words and words in Chinese sentences Closely packed together, it is much more difficult for a computer to understand. Chinese word segment...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27
Inventor 彭艺苏黎韡邵玉斌龙华宋浩
Owner KUNMING UNIV OF SCI & TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products