Parallel k-means algorithm used for high-dimensional text data

A text data and algorithm technology, applied in the direction of electrical digital data processing, special data processing applications, computing, etc., can solve the problems of large randomness of clustering results and inappropriate use of distance to measure similarity, so as to improve the operation speed and reliability The effect of improving portability and accuracy

Inactive Publication Date: 2018-11-06
DALIAN UNIV OF TECH
View PDF3 Cites 13 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

For high-dimensional text data, if the classic k-means proposed by Lloyd is used, the method of randomly selecting the initial clustering seeds will make the randomness of the clustering results too large; in addition, in the clustering process of the algorithm, the Distance-based data similarity measurement method, but for high-dimensional document data, using distance to measure similarity is not suitable

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Parallel k-means algorithm used for high-dimensional text data
  • Parallel k-means algorithm used for high-dimensional text data
  • Parallel k-means algorithm used for high-dimensional text data

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0031] Embodiments of the present invention will be described in detail below.

[0032] A parallel k-means algorithm for high-dimensional text data, including the following steps:

[0033] The first step is to train high-dimensional text data to obtain low-dimensional data

[0034] 1.1) Select an appropriate text representation model to vectorize text data. The present invention adopts the Bag-of-words text representation model, and the matrix size that obtains by this model is N*d, and what N represents among them is text quantity, and what d represents is the feature word number of whole corpus, also can be said to be each The current dimension of the document. Such as figure 2 The process of vectorizing text using the bag-of-words model is shown: first, read the content of the file line by line; then, use the nltk library in Python to segment the read content; after that, read the content in the stop word list, Remove the stop words in the file; finally, the feature wo...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention belongs to the intersection field of natural language processing and machine learning, and provides a parallel k-means algorithm used for high-dimensional text data. According to the algorithm, firstly, the text data are vectorized, then a dimension reduction model is established for a matrix after vectorization, and the high-dimensional data are converted into low-dimensional data with highly efficient features; and then algorithm accuracy is further improved through the optimized k-means clustering algorithm, then clustering is carried out on the data after dimension reduction,algorithm parallelization is realized through GPU and MPI technology, and finally, improvement of accuracy and efficiency of the high-dimensional text data is realized. The algorithm can greatly improve accuracy of text clustering, and improve running speed and portability of the algorithm.

Description

technical field [0001] The invention belongs to the cross field of natural language processing and machine learning, and relates to a parallel k-means algorithm (pk-means) for medical text data. Background technique [0002] The medical text data mentioned in the present invention refers to the medical orders issued by doctors, and the medical orders are short text data, so the medical orders in the entire medical order corpus have high-dimensional and sparse characteristics. The present invention is in the preprocessing part of the whole project, and intends to use an efficient method to perform initial clustering of all medical orders and prepare for other follow-up work of the project. In the following, we will use the term "text data" to refer to "medical text data". [0003] At this stage, clustering for text data can be divided into two steps: data preprocessing and clustering. [0004] In terms of data preprocessing, the idea based on dimensionality reduction is mai...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30G06K9/62G06N3/04G06N3/08
CPCG06N3/084G06N3/045G06F18/23213
Inventor 申彦明单晓磊王宇新
Owner DALIAN UNIV OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products