Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Quick clustering method for massive text data

A text data and clustering method technology, applied in the field of text clustering, can solve the problems of small similarity, lack, and the clustering algorithm cannot meet the needs of text data clustering, and achieve the effect of the optimal clustering algorithm strategy

Pending Publication Date: 2020-04-24
成都迪普曼林信息技术有限公司 +1
View PDF6 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0002] Text clustering is based on the characteristics of high similarity between the same document and small similarity between different documents to aggregate similar documents. At present, most clustering methods focus on improving the accuracy of clustering. In terms of clustering efficiency, It takes a long time to process text data of thousands or tens of thousands of levels, and the accuracy of general clustering algorithms cannot meet the clustering requirements of such a huge amount of text data. The impact of clustering time is not the same. Therefore, in the clustering evaluation of a certain clustering algorithm, the clustering accuracy and clustering time can be used as observation points to effectively judge the clustering effect, so as to provide different clustering algorithms. Efficient clustering algorithm selection according to class requirements; currently, there is a lack of clustering processing and clustering effect evaluation of text data with thousands or tens of thousands of levels

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Quick clustering method for massive text data
  • Quick clustering method for massive text data

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0032] In order to have a clearer understanding of the technical features, purposes and effects of the present invention, the specific implementation manners of the present invention will now be described with reference to the accompanying drawings.

[0033] Massive text data fast clustering method, suitable for execution in computer equipment, the command line parameters input by the external interface and the text information read in the specified directory are preprocessed, and then the preset structure is called through the internal interface to complete the processing in the specified directory Clustering of text data, output the clustering results of EXCEL files or graphical interface in the specified directory, and evaluate the clustering effect.

[0034] The command line parameters include clustering algorithm, word vector encoding method, text distance measurement method and evaluation method; clustering algorithm includes K-means, single-pass clustering, hierarchical ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a quick clustering method for massive text data. The method comprises the following steps: preprocessing externally input command line parameters and text information read undera specified directory, calling a preset structure body through an internal interface to complete text data clustering under the specified directory, outputting an EXCEL file or a graphical interfaceclustering result under the specified directory, and evaluating a clustering effect. The method specifically comprises the following steps of text data reading, text information preprocessing, text data clustering processing and clustering result output, wherein the text information preprocessing comprises the steps of S1, performing word segmentation on a Chinese document, and performing TOKEN processing on an English document; S2, removing stop words; s3, calculating a simhash code of the document after the stop word is removed; s4, carrying out word embedding in a word2vector vector mode, and calculating a document vector with stop words removed; s5, carrying out word embedding in a bert vector mode to obtain a word vector. According to the method, the optimal clustering algorithm strategy of the clustering algorithm is realized through internal or external evaluation.

Description

technical field [0001] The invention relates to the field of text clustering, in particular to a fast clustering method for massive text data. Background technique [0002] Text clustering is based on the characteristics of high similarity between the same document and small similarity between different documents to aggregate similar documents. At present, most clustering methods focus on improving the accuracy of clustering. In terms of clustering efficiency, It takes a long time to process text data of thousands or tens of thousands of levels, and the accuracy of general clustering algorithms cannot meet the clustering requirements of such a huge amount of text data. The impact of clustering time is not the same. Therefore, in the clustering evaluation of a certain clustering algorithm, the clustering accuracy and clustering time can be used as observation points to effectively judge the clustering effect, so as to provide different clustering algorithms. Efficient cluste...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/35
CPCG06F16/35
Inventor 陈泽勇张治同李志强姚松张莉
Owner 成都迪普曼林信息技术有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products