Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Professional field-oriented on-line theme detection method

A detection method and a technology in the professional field, applied in the field of topic detection and tracking technology, can solve problems such as text clustering of large-scale data sets, AP clustering algorithm calculation for a long time, and high algorithm complexity, so as to improve the accuracy of clustering , solve the system performance degradation, improve the effect of clustering speed

Active Publication Date: 2017-08-18
TIANJIN UNIV
View PDF2 Cites 15 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] The disadvantage of the AP algorithm is that the AP algorithm needs to occupy two CPU times when calculating the data, and the algorithm complexity is high. Therefore, when the data set size is relatively large (N>3000), the AP clustering algorithm often needs to be calculated for a long time.
In recent years, with the substantial increase in the size of web pages, the AP clustering algorithm has been unable to meet the clustering requirements of large-scale data sets.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Professional field-oriented on-line theme detection method
  • Professional field-oriented on-line theme detection method
  • Professional field-oriented on-line theme detection method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0035] The present invention will be described in further detail below in conjunction with the accompanying drawings.

[0036] Such as figure 1 Shown, be the flow chart of algorithm of the present invention, comprise:

[0037] Step 1. Preprocessing: First, perform text preprocessing, including word segmentation, stop word removal, calculation of TF / IDF values, vectorization, standardization and other preprocessing operations, to obtain the text vector matrix of the text set, and extract the dictionary from the text set ;

[0038] Step 2. Topic decomposition: Decompose the preprocessed text set according to the LDA model to obtain the potential topic structure;

[0039] Step 3. Calculate p(θ k |d) and p(ω|θ k ), text d on topic θ k The mixing weight p(θ k |d) and topic θ k The frequency of occurrence of the feature word ω in p(ω|θ k ) is a random variable, and the control parameters topic-word distribution φ and text-topic distribution θ are introduced to estimate p(θ ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a professional field-oriented on-line theme detection method. The method comprises the following steps: obtaining a text vector matrix of a preprocessed text set, and extracting a dictionary from the text set; modeling the text vector matrix; calculating a mixed weight p (thetak|d) from a text d to a theme thetak and a frequency p (w|thetak) that a feature word appears in each theme thetak; obtaining the similarity between two texts di and dj, defining a theme model-based theme distance between the texts into a relative entropy distance of a text vector, and calculating a similarity matrix; compressing the text set, thus obtaining a new text sample sect; calculating a similarity matrix of the new text sample set, and selecting a deviation parameter p according to the similarity matrix; combining clustering results, thus generating a new clustering result; calculating distances between all texts in the original text set and compressed classified texts, and performing classification; outputting a text set theme and a final clustering result. Compared with the prior art, the professional field-oriented on-line theme detection method disclosed by the invention has the advantage that by the adoption of an optimal clustering algorithm, the accuracy and the efficiency of the clustering effect are improved.

Description

technical field [0001] The invention belongs to the technical fields of data mining, natural language processing, information extraction and information retrieval, and in particular relates to a topic detection and tracking technology. Background technique [0002] Currently, in the related technologies of topic detection, the commonly used clustering algorithms mainly include K-means clustering algorithm (K-means) and affine propagation clustering algorithm (AP algorithm). K-means clustering algorithm (K-means) is the most popular and typical distance-based partitioning clustering algorithm. The K-means algorithm uses distance as the evaluation index of similarity, and considers that a cluster is a set composed of a group of objects that are similar to each other, so the final goal is to obtain a compact and independent cluster. The K-means algorithm uses a randomly selected point as the initial center point, and then divides the points in the set into corresponding catego...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06K9/62
CPCG06F16/355G06F18/23213G06F18/24137
Inventor 喻梅原旭莹于健高洁王建荣辛伟
Owner TIANJIN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products