Multi-document subject discovery method based on two-layer clustering

A discovery method and multi-document technology, applied in special data processing applications, instruments, electronic digital data processing, etc., can solve problems such as uneven distribution, inconvenient discovery of multi-document topics, and the impact of sentence clustering, so as to reduce the spatial dimension. , highlight the similarity of the theme content, and improve the effect of computing speed

Inactive Publication Date: 2015-07-15
SOUTH CHINA UNIV OF TECH +2
View PDF6 Cites 18 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Usually, the distribution of words in the semantic space is not uniform, so the "skew" between the feature components in the traditional vector space model will have a negative impact on sentence clustering
In the density-based sentence clustering algorithm, the general radius parameter needs to be specified in advance, which also brings inconvenience to the topic discovery of multiple documents

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Multi-document subject discovery method based on two-layer clustering
  • Multi-document subject discovery method based on two-layer clustering
  • Multi-document subject discovery method based on two-layer clustering

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0041] Such as figure 1 As shown, the multi-document topic discovery method based on two-level clustering in this embodiment includes the following steps:

[0042] S1. Taking multiple documents as input, preprocessing each document, including segmenting the document, segmenting the sentence, obtaining the noun set and verb set in the multi-document collection, and disambiguating the polysemous words among them Processing; the specific method of word sense disambiguation processing is:

[0043] For the result after word segmentation, first mark its part of speech, focusing only on the noun set and the verb set. For the polysemous word w, first use the semantic dictionary to get its meaning, and then calculate the meaning of each word and the k words of the same part of speech before and after it. The sum of the word meaning similarity of.

[0044] The calculation method of the above-mentioned word meaning similarity is:

[0045] S11. For the word meaning similarity of the Chinese co...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a multi-document subject discovery method based on two-layer clustering. The multi-document subject discovery method comprises the following steps: S1 using a plurality of documents as input, pretreating each document, i.e. the documents are broken up into clauses, and the clauses are broken up into words, so as to obtain a noun group and a verb group in a multi-document group, and performing emantic disambiguation processing on polysemes in the noun group and the verb group; S2 respectively performing word clustering analysis on the noun group and the verb group which are output in the step S1 according to word similarity by adopting an improved OPTICS algorithm, extracting semantic concepts, and establishing vector space models on the clauses according to the semantic concepts; S3 performing clustering analysis on the clauses by using an improved K-medoid algorithm, so as to obtain a subject. Inner semantic relations between words are extracted by the multi-document subject discovery method, and the problem of non-orthogonality among feature items when feature vectors of the clauses are established is solved.

Description

Technical field [0001] The present invention relates to the research field of two-layer clustering, in particular to a method for discovering multi-document topics based on two-layer clustering. Background technique [0002] In terms of sentence representation for multi-document topic discovery, general technologies mainly use sentence segmentation, and use word frequency vectors or TF-IDF vectors based on word segmentation results to represent sentences. Normally, the distribution of words in the semantic space is not uniform. In this way, the "skew" between the feature components in the traditional vector space model will have a negative impact on sentence clustering. In the density-based sentence clustering algorithm, the general radius parameter needs to be specified in advance, which also brings inconvenience to multi-document topic discovery. Summary of the invention [0003] The main purpose of the present invention is to overcome the shortcomings and deficiencies of the p...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06F17/27
Inventor 陈健袁慎溪
Owner SOUTH CHINA UNIV OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products