Class center compression transformation-based text clustering method in search engine

A text clustering and compression transformation technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve problems such as inaccurate clustering results and the impact of document similarity

Active Publication Date: 2013-03-06
珠海市颢腾智胜科技有限公司
View PDF6 Cites 19 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Therefore, the calculation of document similarity is

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Class center compression transformation-based text clustering method in search engine
  • Class center compression transformation-based text clustering method in search engine
  • Class center compression transformation-based text clustering method in search engine

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0053] The present invention will be further described below in conjunction with the accompanying drawings and embodiments.

[0054] The text clustering method based on the class center compression transformation of the present invention fully excavates the potential semantic association between text words, calculates the word center, compresses the class center, and improves the accuracy of text clustering. Calculate the similarity between the class center and the text, iteratively split and merge, and reorganize the class center until a certain standard is met. Said mining the potential semantic association between text words, using the improved tf-idf to calculate the similarity between texts, as an important index to measure the association degree between text words. At the same time, the title of each document is extracted and word-segmented, and the similarity of the title vocabulary is weighted.

[0055] tf new =log(tf)+1

[0056] Where fileNum is the total number ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a class center compression transformation-based text clustering method in a search engine. The method comprises the following steps of: by using an improved tf-idf formula, calculating word weight of each file in a text set, calculating an initial class center, mining a synonym word set and a concurrent high-frequency word set, calculating a word center and performing primary classification according to similarity of the initial class center with each file; compressing the center word according to information such as title word, article length, synonyms and concurrent associated words, thereby guaranteeing that the same word only occurs in some class centers with high similarity with the word; clustering the file by using a new cluster center again; calculating core similarity of each class; splitting the biggest class; combining smaller classes to produce a new class; iterating compression, clustering and split operation until the number of the classes converges; and guaranteeing that the similarity of the text in the same class with the cluster center reaches a certain threshold value. The clustering accuracy is obviously higher than those of the conventional methods such as KMeans and DBSCAN (Density-based Spatial Clustering of Applications with Noise).

Description

technical field [0001] The invention belongs to the technical field of text mining and machine learning research, and particularly relates to a text clustering method based on class center compression transformation in a search engine. By combining synonymous phrases, co-occurrence associated phrases, vocabulary centers, class centers, title content, Document length and other factors, repeated clustering and splitting iterative methods for text sets to improve clustering accuracy. The method is suitable for search engines and information retrieval systems. Background technique [0002] In the real world, text is the most important carrier of information, in fact, research shows that 80% of information is contained in text documents. Especially on the Internet, text data widely exists in various forms, such as news reports, e-books, research papers, digital libraries, web pages, emails, and so on. Text clustering technology can be applied to information filtering and person...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 欧阳元新谢舒翼刘文琦熊璋
Owner 珠海市颢腾智胜科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products