Canonical-correlation-analysis-based computer data attribute reduction method

A typical correlation, data attribute technology, applied in the field of data processing, can solve the problem of not considering the conditional attribute correlation in the information table

Inactive Publication Date: 2016-09-14
NANJING UNIV
View PDF5 Cites 6 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] Purpose of the invention: the purpose of the present invention is to propose a computer data attribute reduction method based on canonical correlation analysis (Canonical Correlation Analysis, CCA) for existing attribute reduction methods that do not consider the problem of correlation between conditional attributes in information tables. simple method

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Canonical-correlation-analysis-based computer data attribute reduction method
  • Canonical-correlation-analysis-based computer data attribute reduction method
  • Canonical-correlation-analysis-based computer data attribute reduction method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0069] Each step of the present invention is described below according to an embodiment. The method of the present invention is basically applicable to all data used for classification processing. This embodiment takes common text data as an example. Douban will classify a large number of books in order to recommend books of a certain category to users. If it is almost unrealistic to classify these books manually, it will be of great practical significance to automatically classify books according to their text content. However, the biggest problem with text processing is that the text data contains a large number of words, resulting in a high dimension of the text, and some even reach tens of thousands of dimensions. At the same time, this tens of thousands of dimensional data usually contains a lot of useless data, which not only interferes with the classification accuracy, but is also very time-consuming. Therefore, it is necessary to reduce the attributes of such data, ...

Embodiment 2

[0110] The second dataset comes from two medical institutions. The data set contains diagnostic information of normal people and patients, and the purpose is to distinguish between normal person diagnostic data and patient diagnostic data based on these data. All the data are mass spectrometry data extracted by SELDI technology, and then the mass spectrometry data is processed to obtain 10,000-dimensional features. However, these 10,000-dimensional features contain a lot of redundant information. If they are directly distinguished, the classification effect will not be very good, so it is necessary to reduce the dimensionality first.

[0111] For the typical correlation analysis stage of step (1), the data attribute set is also divided into two sub-sets, and the attribute dimension of each set is 5000. Afterwards, attribute correlation analysis is performed on it. Because the attribute dimension is large, the fusion granularity is set slightly larger here, which is 100, 300, ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a canonical-correlation-analysis-based computer data attribute reduction method. The method comprises: (1), an original attribute set in an information table is segmented into a plurality of sub attribute sets based on an equipartition conception, wherein each sub attribute set is considered as a sub view of the original attribute set; (2), on the basis of the views, a canonical correlation analysis is carried out to obtain correlation information between view features; (3), according to the descending sequence of the correlations, the attributes are combined, and the sub views are combined into one view to obtain a new attribute set; (4), for the new attribute set, an attribute importance degree of each attribute is calculated and sorting of the attribute importance degrees is carried out in a descending order; (5), the attribute with the large attribute importance degree is selected and is added into a reduction set; and (6), an dependency degree of the reduction set is calculated; if the dependency degree is close to that of the original attribute set, the reduction set is outputted; and otherwise, the step (5) is carried out again.

Description

technical field [0001] The invention belongs to the technical field of data processing, and in particular relates to a computer data attribute reduction method based on typical correlation analysis. Background technique [0002] Rough set is an effective tool to solve the problems of imprecision and uncertainty in data mining. The attribute reduction method is an important method used to process data in rough sets. Its purpose is to select some of the most effective attributes from the original attribute set to remove redundant attributes, reduce the dimension of the data set, and improve the performance of the learning algorithm. In the real world, the data generated by the Internet often cannot be directly used in the data mining process. It is necessary to perform specific denoising and simplification processing on these "dirty data", that is, data preprocessing. According to statistics, data preprocessing accounts for more than 60% of the overall process of data mining....

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06K9/62
CPCG06F18/2411
Inventor 商琳李萍吴建阳
Owner NANJING UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products