Parallel k-means algorithm used for high-dimensional text data

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A text data and algorithm technology, applied in the direction of electrical digital data processing, special data processing applications, computing, etc., can solve the problems of large randomness of clustering results and inappropriate use of distance to measure similarity, so as to improve the operation speed and reliability The effect of improving portability and accuracy

Inactive Publication Date: 2018-11-06

DALIAN UNIV OF TECH

View PDF3 Cites 13 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

For high-dimensional text data, if the classic k-means proposed by Lloyd is used, the method of randomly selecting the initial clustering seeds will make the randomness of the clustering results too large; in addition, in the clustering process of the algorithm, the Distance-based data similarity measurement method, but for high-dimensional document data, using distance to measure similarity is not suitable

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0031] Embodiments of the present invention will be described in detail below.

[0032] A parallel k-means algorithm for high-dimensional text data, including the following steps:

[0033] The first step is to train high-dimensional text data to obtain low-dimensional data

[0034] 1.1) Select an appropriate text representation model to vectorize text data. The present invention adopts the Bag-of-words text representation model, and the matrix size that obtains by this model is N*d, and what N represents among them is text quantity, and what d represents is the feature word number of whole corpus, also can be said to be each The current dimension of the document. Such as figure 2 The process of vectorizing text using the bag-of-words model is shown: first, read the content of the file line by line; then, use the nltk library in Python to segment the read content; after that, read the content in the stop word list, Remove the stop words in the file; finally, the feature wo...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention belongs to the intersection field of natural language processing and machine learning, and provides a parallel k-means algorithm used for high-dimensional text data. According to the algorithm, firstly, the text data are vectorized, then a dimension reduction model is established for a matrix after vectorization, and the high-dimensional data are converted into low-dimensional data with highly efficient features; and then algorithm accuracy is further improved through the optimized k-means clustering algorithm, then clustering is carried out on the data after dimension reduction,algorithm parallelization is realized through GPU and MPI technology, and finally, improvement of accuracy and efficiency of the high-dimensional text data is realized. The algorithm can greatly improve accuracy of text clustering, and improve running speed and portability of the algorithm.

Description

technical field [0001] The invention belongs to the cross field of natural language processing and machine learning, and relates to a parallel k-means algorithm (pk-means) for medical text data. Background technique [0002] The medical text data mentioned in the present invention refers to the medical orders issued by doctors, and the medical orders are short text data, so the medical orders in the entire medical order corpus have high-dimensional and sparse characteristics. The present invention is in the preprocessing part of the whole project, and intends to use an efficient method to perform initial clustering of all medical orders and prepare for other follow-up work of the project. In the following, we will use the term "text data" to refer to "medical text data". [0003] At this stage, clustering for text data can be divided into two steps: data preprocessing and clustering. [0004] In terms of data preprocessing, the idea based on dimensionality reduction is mai...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F17/30G06K9/62G06N3/04G06N3/08

CPCG06N3/084G06N3/045G06F18/23213

Inventor申彦明单晓磊王宇新

OwnerDALIAN UNIV OF TECH

Parallel k-means algorithm used for high-dimensional text data

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology