A Chinese word segmentation method based on naive Bayesian algorithm

A Bayesian algorithm and Chinese word segmentation technology, which is applied in computing, computer components, special data processing applications, etc., and can solve problems such as inconsistency

Inactive Publication Date: 2019-03-01
KUNMING UNIV OF SCI & TECH
View PDF8 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

But in fact, this is not always the case. This is because the NBC model assumes that the attributes are independent of each other. This assumption is often not true in practical applications, which has a certain impact on the correct classification of the NBC model.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A Chinese word segmentation method based on naive Bayesian algorithm

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0035] Embodiment 1: as figure 1 As shown, a Chinese word segmentation method based on the naive Bayesian algorithm, first selects the appropriate document as the corpus, and divides the corpus into sentences; then marks the corpus, not only marking the state for each word, but also Mark the part of speech; then count the marked corpus to obtain a state transition matrix, which provides the basis for the later prediction stage; then extract the features of each word from the marked corpus, in order to improve accuracy, the features of each word include The properties of the upper and lower characters; then use the feature file of each Chinese character to train a model; then use the state transition matrix and probability model to predict each Chinese character in the sentence to be segmented; finally, according to the different status of the Chinese character, the Sentence participle.

[0036] The specific steps are:

[0037] (1) Find a corpus suitable as a training set, an...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a Chinese word segmentation method based on a naive Bayesian algorithm, belonging to the field of natural language processing. The invention firstly selects suitable documentsas a corpus and processes the corpus according to sentence lines; then the corpus is tagged, not only for each word tagged state, but also tagged part of speech; then the tagged corpus is counted anda state transition matrix is obtained, which provides a basis for the future prediction phase. Then, the features of each word are extracted from the tagged corpus. In order to improve the accuracy,the features of each word include the attributes of the next word. Then a model is trained using the feature files of each Chinese character. Then each Chinese character in the sentence to be segmented is predicted by state transition matrix and probability model. Finally, according to the different state of Chinese characters, the sentences with segmentation are segmented.

Description

technical field [0001] The invention relates to a Chinese word segmentation method based on a naive Bayesian algorithm, which belongs to the field of natural language processing. Background technique [0002] Chinese Word Segmentation refers to dividing a sequence of Chinese characters into individual words. Word segmentation is the process of recombining continuous word sequences into word sequences according to certain specifications. We know that in English writing, spaces are used as natural delimiters between words, but in Chinese, only words, sentences and paragraphs can be delimited by obvious delimiters, except that words do not have a formal delimiter , although English also has the problem of dividing phrases, but at the level of words, Chinese is much more complicated and difficult than English. For Chinese word segmentation, the most important thing for search engines is not to find all the results, because it does not make much sense to find all the results in...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27G06K9/62
CPCG06F40/211G06F40/242G06F40/289G06F18/24155
Inventor 邵玉斌郭海震龙华杜庆治
Owner KUNMING UNIV OF SCI & TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products