Iteration-based three-step unsupervised Chinese word segmentation method

A Chinese word segmentation, unsupervised technology, applied in character and pattern recognition, special data processing applications, instruments, etc., can solve problems of high complexity

Active Publication Date: 2018-05-22
北京时空迅致科技有限公司
View PDF4 Cites 10 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, MCA is composed of five complex sub-models, and it also needs to preprocess the corpus for word alignment. The complexity is high, and there is a large room for improvement in word segmentation accuracy.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Iteration-based three-step unsupervised Chinese word segmentation method
  • Iteration-based three-step unsupervised Chinese word segmentation method
  • Iteration-based three-step unsupervised Chinese word segmentation method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0085] The present invention describes the specific implementation of a three-step unsupervised Chinese word segmentation method based on iteration.

[0086] From figure 1 It can be seen that a three-step unsupervised Chinese word segmentation method based on iteration includes three processes of initialization, iterative processing, and adjustment processing. The iterative processing includes three steps of local segmentation, global word selection, and corpus subtraction.

[0087] In the unsupervised word segmentation framework of the present invention, in each iteration of specific implementation, the first step uses the word formation probability model based on segmentation-context independence (MISC) to perform locally optimal unsupervised segmentation on the text corpus. The MISC model does not need to introduce statistical assumptions about the segmentation length, and it takes into account both global and local features, and the form is simple and effective; for the long t...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses an iteration-based three-step unsupervised Chinese word segmentation method and belongs to the field of natural language processing technology. According to the basic thought,the method is an unsupervised word segmentation framework including local segmentation, global word selection and corpus reduction iteration execution; and in each iteration, a word formation probability model based on segmentation-context mutual independency is utilized to perform locally optimal unsupervised segmentation on text corpus, and the form is simple and effective; a document-level pulse weighting method is adopted according to the long-tail phenomenon; according to a global support degree, new words are screened, and a dictionary is incrementally generated; and last, a text is divided based on the longest matching and maximum probability principle of the dictionary, formed segmented words are filtered out, continuous non-segmented words are stitched, the words are reconstructedinto a scale-reduced training corpus, and similar iteration processing is performed on the remaining corpus till no new word is generated. The method is superior to an existing Chinese unsupervised word segmentation algorithm with best performance.

Description

Technical field [0001] The invention relates to a three-step unsupervised Chinese word segmentation method based on iteration, which belongs to the technical fields of artificial intelligence, machine learning and natural language processing. Background technique [0002] Various natural language processing tasks, including information retrieval, machine translation, text understanding and mining, etc., are all carried out with words as the basic unit. Chinese text consists of a sequence of consecutive characters, with no boundaries between words in the sentence. Therefore, the natural language processing of Chinese requires word segmentation first, that is, segmentation of consecutive character sequences into word sequences, and then processing such as grammatical analysis, semantic understanding and pragmatic analysis. [0003] The existing word segmentation algorithms can be roughly divided into supervised methods and unsupervised methods. Supervised word segmentation is to pe...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27G06F17/30G06K9/62
CPCG06F16/3346G06F16/374G06F40/284G06F18/214
Inventor 袁武袁文
Owner 北京时空迅致科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products