Dictionary-based lucene Chinese word segmentation method

A Chinese word segmentation and dictionary technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve problems such as not being able to support Lucene well, achieve strong versatility, and improve effectiveness

Active Publication Date: 2016-03-23
成都天府云数信息技术有限公司
View PDF2 Cites 45 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

But currently Lucene only provides Chinese single-character and double-character word segmentation mechanisms, and these two Chinese word segmentation modules cannot well support Lucene Chinese analysis and processing

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Dictionary-based lucene Chinese word segmentation method
  • Dictionary-based lucene Chinese word segmentation method
  • Dictionary-based lucene Chinese word segmentation method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0024] The dictionary-based lucene Chinese word segmentation method of the present invention mainly includes two stages, one is the construction of a professional dictionary, and the other is text word segmentation. figure 1 It is a flowchart of a specific embodiment of the dictionary-based lucene Chinese word segmentation method of the present invention. Such as figure 1 Shown, the lucene Chinese word segmentation method based on dictionary of the present invention comprises the following steps:

[0025] S101: Build a professional dictionary:

[0026] The present invention firstly needs to collect corpus and construct a professional dictionary. figure 2 It is a flow chart of building a professional dictionary. Such as figure 2 Shown, the concrete steps of constructing professional dictionary among the present invention are:

[0027] S201: Corpus preprocessing:

[0028] First, the collected corpus needs to be preprocessed, that is, the manually collected stop words are...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a dictionary-based Chinese word segmentation method. The method comprises the steps of collecting linguistic data; establishing a terminological dictionary, wherein the establishing method comprises the steps of removing stop words firstly, dividing the linguistic data into text fragments, exacting candidate words from the text fragments, obtaining the appearance probability of the candidate words and each individual character in all the text fragments through statistics, calculating the mutual information of two Chinese characters in each candidate word, keeping the candidate words if mutual information is larger than a preset mutual information threshold value, deleting the candidate words otherwise, combining the candidate words obtained after screening, matching and filtering the combined candidate words by means of a general dictionary, and adding the candidate words obtained after filtration into the terminological dictionary; conducting word segmentation on a text with words to be segmented by means of the terminological dictionary firstly, and then conducting word segmentation on the rest of texts by means of the general dictionary. The terminological dictionary is established by extracting terminologies from the linguistic data through statistics, universality is high, and requirements of the professional field can be effectively met by conducting word segmentation with the terminological dictionary.

Description

technical field [0001] The invention belongs to the technical field of Chinese word segmentation, and more specifically relates to a dictionary-based Lucene Chinese word segmentation method. Background technique [0002] There is an obvious difference between Chinese information and English information. English words are separated by spaces; in Chinese text, there is no obvious separator between words, and Chinese words are mostly composed of two or more Chinese characters. , and the sentences are written consecutively. This means that before automatic analysis of Chinese text, a whole sentence must be cut into small lexical units, namely Chinese word segmentation. Chinese word segmentation is a difficult point in today's Chinese information processing and retrieval, and it is an inevitable problem in the research field. Now, Chinese word segmentation has achieved some results, and has been widely used in many aspects such as information retrieval. [0003] With the rapid ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/3335G06F16/374
Inventor 孙健张祥
Owner 成都天府云数信息技术有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products