Discriminant text clustering method and system based on minimum normalized information distance

A text clustering and discriminant technology, which is applied in the field of discriminant text clustering methods and systems, can solve problems such as underfitting models, poor clustering models, and insufficiency, and achieve the effect of overcoming overfitting

Pending Publication Date: 2020-04-03
UNIV OF SCI & TECH OF CHINA
View PDF3 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] 1. If the selection of the initial model order is unreasonable, it will lead to unsatisfactory clustering results. For example, if the selection is too large, it is easy to produce an over-fitting model, that is, "the examples with high similarity that should belong to a cluster may be further subdivided model to different clusters", an extreme example is to divide a training example into a cluster, such clustering results are meaningless; and if the selection is too small, it is easy to produce an underfitting model, that is, "the similarity is not sufficiently low An example of a separate model"
[0005] 2. For the situation where the number of examples in each potential cluster is very different, the discriminative clustering algorithm based on maximum mutual information is prone to produce poor clustering models, that is, "a model that divides data with high similarity into different clusters"

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Discriminant text clustering method and system based on minimum normalized information distance
  • Discriminant text clustering method and system based on minimum normalized information distance
  • Discriminant text clustering method and system based on minimum normalized information distance

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0052] In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with specific embodiments and with reference to the accompanying drawings.

[0053] Aiming at the model selection problem existing in the existing discriminant clustering algorithm, the present invention proposes a method of using normalized information measure as the objective function, so that the algorithm has the ability of automatic model selection, thereby improving the algorithm's unreasonable initial model order artificially selected. The ability to obtain better clustering results under certain circumstances.

[0054] Data clustering is to divide the collection of data objects into multiple different classes or clusters. The similarity between data objects in each cluster is higher than that of objects in other clusters. In text processing, customer group grouping And image segmentation a...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a discriminant text clustering method and system based on a minimum normalized information distance, and the method comprises the steps: carrying out the vectorization of a text data set which comprises a plurality of texts, wherein each text comprises a plurality of keywords; aiming at the vectorized text data set, initializing a model parameter set; calculating and updating the parameter set by a gradient descent method through the minimum normalized information distance; setting a termination condition and outputting a final parameter set; and designing a discriminant text clustering algorithm by utilizing the final parameter set to realize text clustering. The invention provides the discriminant text clustering method and system based on the minimum normalized information distance. Aiming at the model selection problem of the existing discriminant clustering algorithm, the invention provides the method for using normalized information measure as a target function, so the algorithm has automatic model selection capability, and the capability of obtaining a better clustering result under the condition that the initial model order of manual selection is unreasonable is improved.

Description

technical field [0001] The invention relates to the fields of natural language processing and text mining, in particular to a discriminative text clustering method and system based on minimum normalized information distance. Background technique [0002] Existing text clustering mostly uses the k-means algorithm, and the method of maximizing mutual information (or its variants) is mostly used in the discriminative clustering algorithm. These methods are likely to cause the model order (number of clusters, such as K) of K-means is always equal to the initial value, which makes this type of algorithm not have the ability of automatic model selection, so the model order of the final clustering result is largely determined by humans. However, it is difficult for people to give the most reasonable model order in text clustering, and a large or small model order can easily lead to poor clustering results. [0003] The existing discriminative clustering algorithm based on maximum ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F16/35
CPCG06F16/35
Inventor 秦家虎朱英达付维明
Owner UNIV OF SCI & TECH OF CHINA
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products