Multi-document subject discovery method based on two-layer clustering

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A discovery method and multi-document technology, applied in special data processing applications, instruments, electronic digital data processing, etc., can solve problems such as uneven distribution, inconvenient discovery of multi-document topics, and the impact of sentence clustering, so as to reduce the spatial dimension. , highlight the similarity of the theme content, and improve the effect of computing speed

Inactive Publication Date: 2015-07-15

SOUTH CHINA UNIV OF TECH +2

View PDF6 Cites 18 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

Usually, the distribution of words in the semantic space is not uniform, so the "skew" between the feature components in the traditional vector space model will have a negative impact on sentence clustering

In the density-based sentence clustering algorithm, the general radius parameter needs to be specified in advance, which also brings inconvenience to the topic discovery of multiple documents

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment

[0041] Such as figure 1 As shown, the multi-document topic discovery method based on two-level clustering in this embodiment includes the following steps:

[0042] S1. Taking multiple documents as input, preprocessing each document, including segmenting the document, segmenting the sentence, obtaining the noun set and verb set in the multi-document collection, and disambiguating the polysemous words among them Processing; the specific method of word sense disambiguation processing is:

[0043] For the result after word segmentation, first mark its part of speech, focusing only on the noun set and the verb set. For the polysemous word w, first use the semantic dictionary to get its meaning, and then calculate the meaning of each word and the k words of the same part of speech before and after it. The sum of the word meaning similarity of.

[0044] The calculation method of the above-mentioned word meaning similarity is:

[0045] S11. For the word meaning similarity of the Chinese co...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a multi-document subject discovery method based on two-layer clustering. The multi-document subject discovery method comprises the following steps: S1 using a plurality of documents as input, pretreating each document, i.e. the documents are broken up into clauses, and the clauses are broken up into words, so as to obtain a noun group and a verb group in a multi-document group, and performing emantic disambiguation processing on polysemes in the noun group and the verb group; S2 respectively performing word clustering analysis on the noun group and the verb group which are output in the step S1 according to word similarity by adopting an improved OPTICS algorithm, extracting semantic concepts, and establishing vector space models on the clauses according to the semantic concepts; S3 performing clustering analysis on the clauses by using an improved K-medoid algorithm, so as to obtain a subject. Inner semantic relations between words are extracted by the multi-document subject discovery method, and the problem of non-orthogonality among feature items when feature vectors of the clauses are established is solved.

Description

Technical field [0001] The present invention relates to the research field of two-layer clustering, in particular to a method for discovering multi-document topics based on two-layer clustering. Background technique [0002] In terms of sentence representation for multi-document topic discovery, general technologies mainly use sentence segmentation, and use word frequency vectors or TF-IDF vectors based on word segmentation results to represent sentences. Normally, the distribution of words in the semantic space is not uniform. In this way, the "skew" between the feature components in the traditional vector space model will have a negative impact on sentence clustering. In the density-based sentence clustering algorithm, the general radius parameter needs to be specified in advance, which also brings inconvenience to multi-document topic discovery. Summary of the invention [0003] The main purpose of the present invention is to overcome the shortcomings and deficiencies of the p...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & AuthorityApplications(China)

IPC IPC(8): G06F17/30G06F17/27

Inventor陈健袁慎溪

OwnerSOUTH CHINA UNIV OF TECH

Multi-document subject discovery method based on two-layer clustering

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology