Class center compression transformation-based text clustering method in search engine

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
A text clustering and compression transformation technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve problems such as inaccurate clustering results and the impact of document similarity

Active Publication Date: 2013-03-06

珠海市颢腾智胜科技有限公司

View PDF6 Cites 19 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

Therefore, the calculation of document similarity is affected, resulting in inaccurate clustering results

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0053] The present invention will be further described below in conjunction with the accompanying drawings and embodiments.

[0054] The text clustering method based on the class center compression transformation of the present invention fully excavates the potential semantic association between text words, calculates the word center, compresses the class center, and improves the accuracy of text clustering. Calculate the similarity between the class center and the text, iteratively split and merge, and reorganize the class center until a certain standard is met. Said mining the potential semantic association between text words, using the improved tf-idf to calculate the similarity between texts, as an important index to measure the association degree between text words. At the same time, the title of each document is extracted and word-segmented, and the similarity of the title vocabulary is weighted.

[0055] tf new =log(tf)+1

[0056] Where fileNum is the total number ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a class center compression transformation-based text clustering method in a search engine. The method comprises the following steps of: by using an improved tf-idf formula, calculating word weight of each file in a text set, calculating an initial class center, mining a synonym word set and a concurrent high-frequency word set, calculating a word center and performing primary classification according to similarity of the initial class center with each file; compressing the center word according to information such as title word, article length, synonyms and concurrent associated words, thereby guaranteeing that the same word only occurs in some class centers with high similarity with the word; clustering the file by using a new cluster center again; calculating core similarity of each class; splitting the biggest class; combining smaller classes to produce a new class; iterating compression, clustering and split operation until the number of the classes converges; and guaranteeing that the similarity of the text in the same class with the cluster center reaches a certain threshold value. The clustering accuracy is obviously higher than those of the conventional methods such as KMeans and DBSCAN (Density-based Spatial Clustering of Applications with Noise).

Description

technical field [0001] The invention belongs to the technical field of text mining and machine learning research, and particularly relates to a text clustering method based on class center compression transformation in a search engine. By combining synonymous phrases, co-occurrence associated phrases, vocabulary centers, class centers, title content, Document length and other factors, repeated clustering and splitting iterative methods for text sets to improve clustering accuracy. The method is suitable for search engines and information retrieval systems. Background technique [0002] In the real world, text is the most important carrier of information, in fact, research shows that 80% of information is contained in text documents. Especially on the Internet, text data widely exists in various forms, such as news reports, e-books, research papers, digital libraries, web pages, emails, and so on. Text clustering technology can be applied to information filtering and person...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F17/30

Inventor欧阳元新谢舒翼刘文琦熊璋

Owner珠海市颢腾智胜科技有限公司

Class center compression transformation-based text clustering method in search engine

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology