Incremental clustering algorithm based on community detection

An incremental clustering and community technology, applied in the field of text clustering, can solve the problems of high computing time complexity, lack of ability to distinguish hot events from continuous reporting events, and large time overhead, so as to reduce computing time overhead Effect

Active Publication Date: 2020-04-10
EAST CHINA NORMAL UNIV +1
View PDF3 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] 1. The time complexity of the incremental clustering algorithm is still relatively large
[0005] DenStream adopts the Online-Offline two-stage clustering framework, which has high computational complexity in the Merging and Pruning stages, which brings huge time overhead; C-DenStream introduces an event-level Must-Link on the basis of DenStream The Cannot-Link restriction improves the clustering results, but there is still the problem of high computational complexity of DenStream; PreDeConStream improves the performance in the Offline stage, but there is still a huge time complexity overhead when searching for the nearest neighbor class
[0006] 2. Lack of ability to distinguish hot events from ongoing reports
[0007] DenStream directly deletes the data in the Outlier-Micro Cluster during the Pruning stage, which means that low-frequency hot events and low-frequency continuous reports are deleted together, causing the risk of information loss; C-DenStream uses a semi-supervised method to classify event-level news However, it is still unable to distinguish hot news and continuous news in the same event category, and it is still unable to distinguish between these two kinds of events; PreDeConStream does not handle this situation, so it lacks the corresponding event distinction ability
[0008] To sum up, the existing incremental clustering algorithms still have high computational time complexity and lack the ability to distinguish hot events from continuous reporting events, and the corresponding incremental text clustering algorithms have not been reported yet.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Incremental clustering algorithm based on community detection
  • Incremental clustering algorithm based on community detection

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0041] See attached figure 1 , carry out the incremental clustering algorithm based on community detection according to the following steps:

[0042] S1: Perform word vector pre-training on the full amount of Chinese financial text corpus to generate a word vector model. The full amount of Chinese financial text corpus is composed of regular crawlers crawling major financial portal websites; the word vector model is pre-trained from the full amount of Chinese financial text corpus, and its training method is fasttext.

[0043] S2: Use Bloom filter technology on the full amount of Chinese financial text corpus, perform text de-duplication screening, and obtain the target financial corpus after text preprocessing. The technology used in the text deduplication adopts BloomFilter, and the text preprocessing includes removing stop words and thualc word segmentation.

[0044] S3: Use TF-IDF technology for the target financial corpus to obtain the Top-k keywords of each corpus docu...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses an incremental clustering algorithm based on community detection. The algorithm is characterized in that the algorithm comprises the steps: employing a Commonity community concept and an Online-Offline two-stage framework, and introducing an IMC concept; obtaining a similarity graph of the target corpus through a similarity calculation method according to the document representation word vector, the representation keyword of the document and the named entity identification prediction word of the document; and finally, processing the similarity graph through a Louvain algorithm to obtain an initialized community result, and employing an incremental clustering algorithm on the basis of the initialized community result to obtain a final clustering result. Compared withthe prior art, the method has the advantages that the calculation time overhead is reduced under the same hardware condition, and the clustering result is quickly generated, so that upstream and downstream services of an application scene can be better served, timely response is achieved, the function of distinguishing hot events from continuously reported events is achieved, and effective clustering and event-level filtering are conducted on news events.

Description

technical field [0001] The invention relates to the technical field of text clustering, in particular to an incremental clustering algorithm based on community detection. Background technique [0002] News is an important source of information, and a news report often contains some specific information, such as a report on a specific company or person. Therefore, many technology companies or researchers are committed to mining valuable information from relevant news reports, so as to serve commercial information analysis or data mining. Clustering is an effective means of gathering related information into topic clusters. With the explosive growth of information, traditional clustering methods will encounter serious performance bottlenecks in the face of large-scale data, and as new data The arrival of will re-cluster the historical data, causing unnecessary performance overhead. Compared with traditional clustering methods, incremental clustering is more suitable for this...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/35G06F16/33G06F16/9535G06F40/295
CPCG06F16/35G06F16/3344G06F16/9535Y02D10/00
Inventor 杨佳乐程大伟罗轶凤钱卫宁周傲英
Owner EAST CHINA NORMAL UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products