Chinese language lexical analysis method based on linear model

A technology of lexical analysis and Chinese, applied in the field of statistical natural language processing, can solve the problem of single word segmentation model, achieve strong generalization ability and improve accuracy

Inactive Publication Date: 2008-10-29
INST OF COMPUTING TECH CHINESE ACAD OF SCI
View PDF0 Cites 30 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, the existing word segmentation system based on character feature classifiers has a single word segmentation model, and it is difficult to directly use some statistical information obtained from the corpus (for example: how likely is a certain word to be marked as a certain part-of-speech tag? How likely is a part-of-speech tag sequence to appear? How likely is a word sequence to appear?), therefore, the accuracy of segmentation and labeling of the existing word segmentation system based on word feature classifiers needs to be improved

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Chinese language lexical analysis method based on linear model
  • Chinese language lexical analysis method based on linear model
  • Chinese language lexical analysis method based on linear model

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0042] In the following, the present invention will be further described by taking an analysis method using a perceptron classifier, a word sequence language model, a part-of-speech tag sequence language model, and a co-occurrence score model of a word-part-of-speech pair set as an example.

[0043] Each model in the present invention is trained in a corpus, and the corpus is a collection of sentences that have undergone word segmentation and part-of-speech tagging. Word segmentation and part-of-speech tagging are done manually by human experts. On this corpus, the machine learning model can learn the knowledge of word segmentation and part-of-speech tagging. This learned knowledge comes in handy when faced with new labeled sentences waiting to be segmented.

[0044] Firstly, the perceptron classifier model and the upper linear model based on the linear interpolation model (ie, the linear lexical analysis model) in the present invention are respectively introduced.

[0045] ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a Chinese lexical analysis method based on a linear model, comprising the following steps: 1) a Chinese sentence is input and the length of an analysis window is set; 2) the verbatim analysis is carried out to the sentence, the character or a character set of each character in the sentence in a time window is input in a perceptron classifier, thus obtaining the score of a perceptron model which tags the current character as the certain word segmentation tag and part-of-speech tag; at the same time, the character or the character set of the character in the time window is input in the linear lexical analysis model, thus obtaining the score of the linear lexical analysis model which tags the current character as the certain word segmentation tag and the part-of-speech tag; 3) the score of the perceptron model and the score of the linear lexical analysis model are carried out the weighted sum, thus obtaining the comprehensive analysis score, the word segmentation tag and the part-of-speech tag with the highest comprehensive analysis score are taken as the word segmentation tag and the part-of-speech tag of the current character; when the word segmentation tag and the part-of-speech tag of all the characters complete the tagging, the lexical analysis of the Chinese sentence is completed. The Chinese lexical analysis method can significantly improve the accuracy of segmentation and tagging.

Description

technical field [0001] The invention relates to the technical field of statistical natural language processing, in particular statistical Chinese word segmentation and part-of-speech tagging. Background technique [0002] There are two goals of Chinese lexical analysis: word segmentation and part-of-speech tagging. Word segmentation is to divide Chinese sentences that are closely connected between characters into words, so as to convert the sequence of Chinese characters into a sequence of Chinese words; part-of-speech tagging is based on word segmentation, and a part-of-speech tag is added to each Chinese word mark, such as verb VV , noun NN and so on. For a given Chinese sentence, how to perform word segmentation and part-of-speech tagging? There are two strategies: one is to perform word segmentation first, and then perform part-of-speech tagging on the basis of word segmentation; the other is to consider part-of-speech tagging during the word segmentation process. Obv...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27
Inventor 姜文斌黄亮刘群吕雅娟
Owner INST OF COMPUTING TECH CHINESE ACAD OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products