Chinese word segmentation method based on hash table dictionary structure

A Chinese word segmentation and hash table technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve problems such as language information organization, achieve the effects of improving efficiency, improving matching efficiency, and increasing comparison speed

Active Publication Date: 2014-03-19
DALIAN UNIV
View PDF2 Cites 21 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The word segmentation method based on understanding is also called the word segmentation method based on artificial intelligence. Due to the generality and complexity of Chinese language knowledge, it is difficult to organize various language information into a form that can be directly read by the machine. Therefore, the current word segmentation system based on understanding is still at the stage of experimental stage

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Chinese word segmentation method based on hash table dictionary structure
  • Chinese word segmentation method based on hash table dictionary structure

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0021] The present invention will be further described below in conjunction with the accompanying drawings.

[0022] Such as figure 1 As shown, first we need to establish a dictionary structure, and store the hash table in the present invention in the memory in the form of a linked list. At the same time, we also need to establish an index table to facilitate queries in subsequent programs.

[0023] In the preprocessing stage, what we need to do is to segment each sentence in the text to be processed with a period as the terminator, so as to reduce the complexity of the subsequent two-way scanning.

[0024] The next thing the system needs to do is to perform forward and reverse maximum matching for each text block to be processed. The basic process of the forward maximum matching method is: assuming that the length of the longest word in the word segmentation dictionary is n, each time a string s of length n is intercepted from the beginning of the string to be segmented, and...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a Chinese word segmentation method based on a hash table dictionary structure. The Chinese word segmentation method comprises the following steps: A, performing pretreatment to a to-be-treated document; B, performing positive maximum matching scanning segmentation and negative maximum matching scanning segmentation to each treatment block; C, comparing the results of two scanning of each treatment block, if the segmentation results of the two scanning are the same, outputting a positive segmentation result, if the segmentation results are different, respectively calculating the segmentation numbers S, the separate word dictionary word numbers D, the non-dictionary word numbers N and the maximum word lengths L of the positive maximum scanning result and the negative maximum scanning result respectively; D, comparing and analyzing data produced in step 3 in combination with the method and then outputting a right result. The Chinese word segmentation method has the benefits that the matching efficiency in the segmentation process is improved, the comparison rate after positive and negative scanning can be improved, and the efficiency of two-way maximum matching algorithm is improved fundamentally.

Description

technical field [0001] The invention relates to the technical field of Chinese information processing, in particular to a Chinese word segmentation method based on a hash table dictionary structure. Background technique [0002] Chinese word segmentation is the most basic and important issue in Chinese information processing. It is a key step in the automatic annotation of Chinese text, search engines, machine translation, speech recognition, etc. The quality of word segmentation directly affects the accuracy of the results. Chinese and English word segmentation are different. There is no formal delimiter between Chinese words and words, and the continuous Chinese character sequence can only be recombined according to certain Chinese norms. However, the complexity and variability of Chinese sentence composition make Chinese word segmentation has always been a difficult point in Chinese information processing. The discovery of unregistered words and the resolution of ambigui...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27
Inventor 盖荣丽高菲
Owner DALIAN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products