Vietnamese multi-category word disambiguation method based on combination method

A combination method and technology of concurrent words, applied in semantic analysis, natural language translation, natural language data processing, etc., can solve problems such as disambiguation of concurrent words, poor generalization performance, and low accuracy rate of Vietnamese part-of-speech tagging , to achieve the effect of identifying

Active Publication Date: 2016-12-07
KUNMING UNIV OF SCI & TECH
View PDF3 Cites 5 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] The present invention provides a method for disambiguating Vietnamese concurrent words based on a combination method to solve the problems of disambiguation of concurrent V...

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Vietnamese multi-category word disambiguation method based on combination method
  • Vietnamese multi-category word disambiguation method based on combination method
  • Vietnamese multi-category word disambiguation method based on combination method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0068] Embodiment 1: as Figure 1-5 Shown, based on the method for disambiguation of Vietnamese concurrently with class words based on combination method, the specific steps of the method for disambiguation of concurrent class words based on Vietnamese language with combination method are as follows:

[0069] Step1. Firstly, the sub-level part-of-speech tagging corpus of Vietnamese sentences is combined with the Vietnamese concurrent word dictionary to extract the Vietnamese concurrent word field library, and then combined with the characteristics of the Vietnamese language and concurrent class words, the Vietnamese disambiguation features are obtained;

[0070] Step2. Use the maximum entropy statistical analysis method to disambiguate the Vietnamese language and word field corpus that has been formed in the Vietnamese language and word field database, and obtain the maximum entropy Vietnamese word and word disambiguation model;

[0071] Step3. Use the conditional random field...

Embodiment 2

[0075] Embodiment 2: as Figure 1-5 As shown, based on the combination method of Vietnamese and class words disambiguation method, the present embodiment is the same as embodiment 1, wherein:

[0076] As a preferred solution of the present invention, the specific steps of the step Step1 are:

[0077] Step1.1, first use the web crawler program to crawl the Vietnamese webpage corpus from the Internet;

[0078] Step1.2. After filtering and denoising the crawled Vietnamese webpage corpus, construct a Vietnamese text-level corpus, and store the Vietnamese text-level corpus in the database;

[0079] The present invention takes into account that there are noises such as repeated webpages and webpage labels in the crawled Vietnamese webpage corpus, and these noises are invalid. Therefore, it is necessary to remove the high-quality text-level corpus containing only Vietnamese through operations such as filtering and denoising, and store it in the database to facilitate data managemen...

Embodiment 3

[0088] Embodiment 3: as Figure 1-5 Shown, based on the Vietnamese language of combined method and class word disambiguation method, present embodiment is identical with embodiment 2, wherein:

[0089] As a preferred solution of the present invention, the specific steps of the step Step1.5:

[0090] Step1.5.1, take out the Vietnamese sentence sub-level part-of-speech tagged corpus from the Step1.4 database, and obtain the Vietnamese sentence sub-level part-of-speech tagged corpus;

[0091] Step1.5.2. Collect Vietnamese dictionaries from websites and dictionaries to form Vietnamese dictionaries;

[0092] Step1.5.3, obtain the Vietnamese dictionary from Step1.5.2, and manually screen and extract to obtain the Vietnamese concurrent word dictionary;

[0093] Step1.5.4, through the artificially written extraction and classifier program, combined with the Vietnamese and classifier dictionary in Step1.5.3, the Vietnamese sentence-level part-of-speech tagging corpus obtained in Step1....

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a Vietnamese multi-category word disambiguation method based on a combination method, and belongs to the technical field of natural language processing. The method comprises the following steps: firstly, extracting Vietnamese multi-category word fields from a Vietnamese text, and constructing a multi-category word field library; secondly, respectively performing maximum entropy, conditional random field and support vector machine disambiguation modeling on the multi-category word field library; disambiguating a multi-category word field testing material to be disambiguated through the three constructed statistic analysis models, and obtaining part-of-speech tags of multi-category words according to analysis results. According to the Vietnamese multi-category word disambiguation method based on the combination method, the Vietnamese multi-category words are effectively disambiguated, and power support is provided for the subsequent work such as Vietnamese part-of-speech tagging, lexical analysis, syntactic analysis, semantic analysis, information extraction, information retrieval and machine translation; the problem of poor generalization performance due to a single learner is solved.

Description

technical field [0001] The invention relates to a combination method-based disambiguation method for Vietnamese concurrent class words, and belongs to the technical field of natural language processing. Background technique [0002] In the field of Vietnamese natural language processing, high-quality Vietnamese corpus construction is the foundation, premise and pillar of follow-up work, which can be widely used in many aspects, such as: entity recognition, noun phrase analysis, syntactic analysis, semantic analysis and Upper-level machine translation, etc. Vietnamese part-of-speech words are the focus and difficulty in Vietnamese part-of-speech tagging work, which directly affects the accuracy of part-of-speech tagging, and at the same time greatly promotes the construction of high-quality Vietnamese part-of-speech tagging corpus; in order to solve the quality of follow-up work and performance, it is necessary to build a high-quality part-of-speech tagging corpus. Therefor...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/27G06F17/28
CPCG06F40/205G06F40/279G06F40/30G06F40/42
Inventor 郭剑毅刘艳超余正涛线岩团严馨文永华
Owner KUNMING UNIV OF SCI & TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products