Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Spark-based multi-feature combined efficient Chinese text clustering method

A clustering method and multi-feature technology, applied in the field of machine learning, can solve the problems of not considering semantic similarity, increase of computational complexity and time complexity, loss of semantic information, etc., achieve good text clustering effect and reduce computing cost and time cost, the effect of reducing complexity

Active Publication Date: 2018-01-16
NANJING UNIV OF SCI & TECH
View PDF2 Cites 15 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0008] (1) High-dimensional sparseness: The current text clustering algorithms are all calculated based on the VSM model. Although this model is very simple, because this model expresses text as vectors, it leads to the generation of high-dimensional vectors. This increases the computational complexity and time complexity
[0009] (2) Loss of semantic information and simplification of clustering features: In the calculation of text similarity based on TF-IDF weights, because the semantic similarity between words is not considered, the effect of clustering is not good
[0010] (3) It takes a long time and takes up a lot of space: most of the current algorithms are based on a single machine, which takes a long time to process data and has low computational efficiency.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Spark-based multi-feature combined efficient Chinese text clustering method
  • Spark-based multi-feature combined efficient Chinese text clustering method
  • Spark-based multi-feature combined efficient Chinese text clustering method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0085] combine figure 1 , a Spark-based multi-feature combined Chinese text efficient clustering method, the specific implementation steps include:

[0086] Step 1: Build the Spark platform and HDFS file system on the physical server;

[0087] Step 2: Upload the original text data set to the HDFS file system, use the ICTCLAS Chinese word segmentation system and the Hadoop parallel computing platform to perform parallel word segmentation processing on the original text data set, and re-upload it to the HDFS file system;

[0088] Step 3: The Spark platform reads the word-divided data set from the HDFS file system, converts it into an elastic distributed data set RDD, and starts a certain number of concurrent data sets according to the number of partitions in the RDD set in the user program. The thread reads the data and stores it in system memory;

[0089] Step 4: According to the interdependence between the partitions in the RDD, the Spark job scheduling system splits the wri...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a Spark-based multi-feature combined efficient Chinese text clustering method. the method comprises the following steps of: uploading mass data sets into an HDFS file system byutilizing high fault tolerance and high data access throughput of the HDFS file system, carrying out data preprocessing and submitting the data sets to a Spark cluster; and after text set preprocessing is completed, respectively calculating a semantic similarity and a word frequency statistics-based cosine similarity of a dimensionality-reduced text, combining the two similarities to obtain a final text similarity, and carrying out text clustering by utilizing the obtained text similarity and combining a maximum distance method. According to the method, semantic information and word frequencystatistics information are combined to ensure that the text similarity calculation is more correct and the number of iterations is greatly decreased at the same time.

Description

technical field [0001] The invention belongs to the field of machine learning, in particular to a Spark-based multi-feature combined Chinese text efficient clustering method. Background technique [0002] Clustering technology is a kind of machine learning. It mainly divides the original sample data set into several different data categories based on the differences between sample data and different parameters. Therefore, the ultimate goal of clustering is to make the difference between different samples divided into the same data cluster smaller, while the difference between samples divided into different data clusters is larger. [0003] Text clustering technology is a kind of clustering, which is mainly based on the following principle: the differences between texts belonging to the same cluster are small, while the differences between texts belonging to different clusters are relatively large. Different from classification, clustering technology belongs to a class of un...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06F17/27G06K9/62
Inventor 蔡晨晓毕涛徐杨卜京姚娟殷明慧
Owner NANJING UNIV OF SCI & TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products