Chinese word segmentation algorithm based on reverse maximum matching

A reverse maximum matching, Chinese word segmentation technology, applied in computing, special data processing applications, instruments, etc., can solve the problems of inaccurate identification of unregistered words, low word segmentation accuracy, low performance, etc., to improve speed and improve relevance. and accuracy, improving efficiency

Inactive Publication Date: 2013-03-27
BEIJING JINHER SOFTWARE
View PDF2 Cites 30 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] The present invention provides a Chinese word segmentation algorithm based on reverse maximum matching, aiming at problems such as low word segmentation ac

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Chinese word segmentation algorithm based on reverse maximum matching

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0022] The present invention will be further described below in conjunction with the accompanying drawings, so that those of ordinary skill in the art can implement it after referring to this specification.

[0023] Such as figure 1 Shown, a kind of Chinese word segmentation algorithm based on reverse maximum matching of the present invention comprises the following steps:

[0024] Step 1, initialize the number of word segmentation dictionaries and the stop word dictionary StopWord in the memory, wherein the word segmentation dictionary database includes a data structure dictionary WordDictionary storing all word segmentation data structures, and a data directory dictionary WordList storing all word segmentation and word segmentation index positions. A single Chinese character is stored in the first layer of the data structure dictionary as the index directory of the data structure dictionary; the index position and the word of all word objects with the single Chinese characte...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a Chinese word segmentation algorithm based on reverse maximum matching, which comprises the following steps: initializing three objects in a memory; inputting the contents of a text which needs word segmentation; splitting characters in the text into different types according to character codes; directly adding characters which are not Chinese characters to word segmentation results according to the character codes after the text is segmented into short sentences; splitting the short sentences into character sets according to a character string matching and decision-making mechanism; matching the character sets with character sets in a word segmentation dictionary based on the reverse maximum matching algorithm; storing matched character sets into a word segmentation result set; combining consecutive unmatched characters; and adding the consecutive unmatched characters to the word segmentation results to complete word segmentation. A quick word segmentation algorithm based on dictionaries is provided, and the dictionary loading efficiency and the word segmentation efficiency are greatly improved while word segmentation accuracy is ensured.

Description

technical field [0001] The invention relates to a text analysis technology in the field of artificial intelligence, in particular to a classification technology for data mining in the field of artificial intelligence, which is applied to functions such as search engines and data mining in Internet products. Background technique [0002] Today, when the amount of information is soaring and gradually showing a trend of bursting, the Internet industry, which receives and disseminates the largest amount of information, has been plagued by a problem, that is, how to make users quickly respond to the colorful information on the website? Accurately search and locate the resources you need. Currently widely used in Internet products is the Chinese word segmentation technology, which splits a piece of text into multiple words by splitting and matching dictionaries to help computers "understand" the core content of the text. For example, the realization of functions such as search en...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 代培杨爱民
Owner BEIJING JINHER SOFTWARE
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products