Data dimension reduction method based on parallel principal component analysis (PCA) algorithm

A principal component analysis and data dimensionality reduction technology, applied in computing, computer parts, instruments, etc., can solve problems such as inability to load at one time, large data scale, etc., to reduce I/O operations and improve processing efficiency.

Inactive Publication Date: 2017-10-20
UNIV OF ELECTRONICS SCI & TECH OF CHINA
View PDF4 Cites 16 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] The present invention aims at the deficiencies in the prior art, and provides a principal component analysis method using the MapReduce parallel computing framework, which solves the problem that the traditional stand-alone principal component analysis algorithm cannot be loaded into the memory at one time because the data scale is too large, and is conducive to reducing I / O operation to improve the processing efficiency of data dimensionality reduction

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Data dimension reduction method based on parallel principal component analysis (PCA) algorithm
  • Data dimension reduction method based on parallel principal component analysis (PCA) algorithm
  • Data dimension reduction method based on parallel principal component analysis (PCA) algorithm

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0030] The technical solution of the present invention will be further described in detail below in conjunction with the accompanying drawings.

[0031] like figure 1 , a data dimensionality reduction method based on parallel principal component analysis algorithm, including the following steps:

[0032] S1: Construct the sample data matrix D by constructing the data to be dimensionally reduced in such a way that each row represents one data and the number of rows represents the number of samples n×m , where the dimension of the sample data is m and the number of samples is n;

[0033] S2: According to the Hadoop cluster node environment used for data processing, the sample data matrix D n×m Carry out horizontal division and divide into N blocks (to ensure that each block can be loaded into memory), that is, D={D 1 ,D 2 ,...,D N}, distribute the data block to N machines for processing respectively, each machine calculates the square matrix and sum vector of the correspond...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a data dimension reduction method based on a parallel principal component analysis (PCA) algorithm. The method comprises the steps of: S1, constructing a sample data matrix D<nxm> by high-dimensional data of which dimensions are to be reduced; S2, calculating a covariance matrix C<mxm> of the sample data matrix D<nxm>; S3, calculating m feature values of the covariance matrix C<mxm> and m corresponding feature vectors; S4, determining the number k of principal components according to the feature values and the feature vectors; and S5, utilizing the feature vectors, which correspond to the top-k greater feature values, to construct a transformation matrix, and utilizing the transformation matrix to calculate a principal component matrix, wherein the principal component matrix is data of which the dimensions are reduced. According to the method, the problem that according to a traditional stand-alone principal component analysis algorithm, the data cannot be loaded into a memory at once because a data size is too large is overcome, I/O operations are reduced, and the processing efficiency of data dimension reduction is improved.

Description

technical field [0001] The invention relates to a high-dimensional data linear dimensionality reduction technology, in particular to a data dimensionality reduction method based on a principal component analysis algorithm. Background technique [0002] With the continuous development of network information technology and mobile Internet, the amount of data in different business vertical fields of enterprises is increasing. How to discover valuable information from these data and provide important decision-making support for enterprises has become the key to the success of enterprises. These data often have two characteristics: one is the large scale of the data; the other is the high dimensionality of the data. Large-scale high-dimensional data poses challenges to data transmission, storage, and data pattern discovery. How to efficiently process large-scale high-dimensional data and effectively pattern discovery is particularly important. In these high-dimensional data, the...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06K9/62
CPCG06F18/2135
Inventor 王勇杨晓东陈炬光杨晨张应福
Owner UNIV OF ELECTRONICS SCI & TECH OF CHINA
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products