Automatic acquisition method of Chinese reduplication words

A technology of automatic acquisition and duplication of words, applied in special data processing applications, instruments, electrical digital data processing, etc.

Inactive Publication Date: 2015-02-25
JIANGSU UNIV OF SCI & TECH
View PDF3 Cites 5 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0010] Technical problem 1: Redundancy patterns of word segmentation after Chinese word segmentation and large-scale corpus statistics
[0011] Technical Problem 2: The Quantification of Redundant Words
[0012] Technical problem 3: Acquisition and verification of "AA"-style redundancies

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Automatic acquisition method of Chinese reduplication words
  • Automatic acquisition method of Chinese reduplication words
  • Automatic acquisition method of Chinese reduplication words

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0075] The present invention will be further described below in conjunction with the accompanying drawings and embodiments.

[0076] According to the definition of Chinese redundancies, the redundant words are classified as follows: "AA", "AAB", "ABB", "ABA", "AABB", "ABAB", "AABC", "BCAA" and "ABAC". "ABAC", "BCAA", and "AABC" redundancies are generally fixed expressions, and most of them are included in Chinese idiom dictionaries. The present invention is aimed at automatic acquisition of these six types of redundancies of "AA", "AAB", "ABB", "ABA", "ABAB" and "AABB".

[0077] Such as figure 1 As shown, the automatic acquisition method of the Chinese redundancies provided by the present embodiment comprises the following steps:

[0078] 1. The steps of using the quintuple model to count the word-segmented corpus, including:

[0079] 1.1 The steps of quintuple model statistics:

[0080] The automatic acquisition of redundancies requires statistics on the redundant pattern...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses an automatic acquisition method of Chinese reduplication words. A reasonably-structured quintuple model is utilized for carrying out statistics on linguistic data obtained after word segmentation so as to obtain candidate sets of kinds of reduplication words, and on this basis, automatic acquisition of the AAB type reduplication words, the ABB type reduplication words, the ABA type reduplication words, the ABAB type reduplication words and the AABB type reduplication words is achieved through calculation and judgment of the reduplication degree; on the basis of judgment of the reduplication degree, automatic acquisition of the AA type reduplication words is further achieved through calculation and judgment of left adjacent entropy and right adjacent entropy. According to the method, quantified judgment and automatic acquisition of the reduplication words are achieved according to statistical information obtained by the reasonably-structured quintuple model and judgment of the reduplication degree and the information entropy. As is shown in experiments, the method is high in accuracy and beneficial for carrying out informatization processing on natural languages more accurately, has very obvious practical significance in the natural language processing field and can be widely applied and popularized.

Description

technical field [0001] The invention relates to natural language processing in the field of artificial intelligence computers, in particular to a method for automatically acquiring Chinese redundancies by using natural language processing. Background technique [0002] In a large number of natural language applications, there is a basic and common problem: for a corpus composed of short texts (hereinafter referred to as short text corpus or corpus), how to organize the short texts according to a certain similarity clustered into different classes. [0003] Redundant words in Chinese are a special language phenomenon. Chinese redundant words are words formed by overlapping two or more Chinese characters with the same shape and meaning. The use of Chinese redundancies in natural language is becoming more and more extensive, and there are constant new redundancies, which bring more challenges to natural language processing. For example, in the field of automatic proofreading ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27
Inventor 刘亮亮吴健康马健
Owner JIANGSU UNIV OF SCI & TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products