A Chinese word segmentation method based on naive Bayesian algorithm

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A Bayesian algorithm and Chinese word segmentation technology, which is applied in computing, computer components, special data processing applications, etc., and can solve problems such as inconsistency

Inactive Publication Date: 2019-03-01

KUNMING UNIV OF SCI & TECH

View PDF8 Cites 0 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

But in fact, this is not always the case. This is because the NBC model assumes that the attributes are independent of each other. This assumption is often not true in practical applications, which has a certain impact on the correct classification of the NBC model.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0035] Embodiment 1: as figure 1 As shown, a Chinese word segmentation method based on the naive Bayesian algorithm, first selects the appropriate document as the corpus, and divides the corpus into sentences; then marks the corpus, not only marking the state for each word, but also Mark the part of speech; then count the marked corpus to obtain a state transition matrix, which provides the basis for the later prediction stage; then extract the features of each word from the marked corpus, in order to improve accuracy, the features of each word include The properties of the upper and lower characters; then use the feature file of each Chinese character to train a model; then use the state transition matrix and probability model to predict each Chinese character in the sentence to be segmented; finally, according to the different status of the Chinese character, the Sentence participle.

[0036] The specific steps are:

[0037] (1) Find a corpus suitable as a training set, an...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention relates to a Chinese word segmentation method based on a naive Bayesian algorithm, belonging to the field of natural language processing. The invention firstly selects suitable documentsas a corpus and processes the corpus according to sentence lines; then the corpus is tagged, not only for each word tagged state, but also tagged part of speech; then the tagged corpus is counted anda state transition matrix is obtained, which provides a basis for the future prediction phase. Then, the features of each word are extracted from the tagged corpus. In order to improve the accuracy,the features of each word include the attributes of the next word. Then a model is trained using the feature files of each Chinese character. Then each Chinese character in the sentence to be segmented is predicted by state transition matrix and probability model. Finally, according to the different state of Chinese characters, the sentences with segmentation are segmented.

Description

technical field [0001] The invention relates to a Chinese word segmentation method based on a naive Bayesian algorithm, which belongs to the field of natural language processing. Background technique [0002] Chinese Word Segmentation refers to dividing a sequence of Chinese characters into individual words. Word segmentation is the process of recombining continuous word sequences into word sequences according to certain specifications. We know that in English writing, spaces are used as natural delimiters between words, but in Chinese, only words, sentences and paragraphs can be delimited by obvious delimiters, except that words do not have a formal delimiter , although English also has the problem of dividing phrases, but at the level of words, Chinese is much more complicated and difficult than English. For Chinese word segmentation, the most important thing for search engines is not to find all the results, because it does not make much sense to find all the results in...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & AuthorityApplications(China)

IPC IPC(8): G06F17/27G06K9/62

CPCG06F40/211G06F40/242G06F40/289G06F18/24155

Inventor邵玉斌郭海震龙华杜庆治

OwnerKUNMING UNIV OF SCI & TECH

A Chinese word segmentation method based on naive Bayesian algorithm

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology