Clustering method for high dimensional data based on Bayes mixed common factor analyzer

A technology of high-dimensional data and common factors, applied in the fields of electrical digital data processing, special data processing applications, instruments, etc.

Inactive Publication Date: 2013-07-31
INFORMATION & COMM BRANCH OF STATE GRID JIANGSU ELECTRIC POWER
View PDF2 Cites 16 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, MFA-based methods for high-dimensional data processing, especially when used for clustering, still have limitations
First of all, in MFA, because each mixture component has a different factor loading matrix, the overall number of parameters of the model is large, and the existing MFA is based on the maximum likelihood criterion for model inference and parameter estimation, so in high When the number of samples of dimensional data is not large, overfitting problems are prone to occur; secondly, and most importantly, in most cases in the application of data clustering, the number of categories is unknown in advance, if set too high Or too low, will affect the accuracy of the final clustering results, and for high-dimensional data, this problem will become more difficult, how to adaptively determine the optimal clustering based on high-dimensional data while reducing dimensionality The number of categories, so as to obtain better clustering performance, is a difficult problem and key point in high-dimensional data clustering techniques and methods

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Clustering method for high dimensional data based on Bayes mixed common factor analyzer
  • Clustering method for high dimensional data based on Bayes mixed common factor analyzer
  • Clustering method for high dimensional data based on Bayes mixed common factor analyzer

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0095] In order to better illustrate the high-dimensional data clustering method based on the Bayesian Mixed Common Factor Analyzer (BMCFA) involved in the present invention, it is applied to the clustering of high-dimensional gene expression data in the field of bioinformatics. The data source to be clustered comes from the preprocessed 248 tissue samples provided by Yeoh et al., and the dimension of each sample is 50 (E. J. Yeoh et al. Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling, Cancer Cell, vol.1, no.2, pp.133-143, 2002.), namely N = 248, p = 50, .

[0096] There are 6 classes in this application, the class names and the number of samples in this class are: MLL (20 samples), T-ALL (43 samples), Hyperdip (64 samples), TEL-AML1 (79 samples ), E2A-PBX1 (27 samples), BCR-ABL (15 samples). Assume that the number of clusters and specific conditions are not known before clustering, and ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a clustering method for high dimensional data based on a Bayes mixed common factor analyzer. The method comprises the following steps: firstly, a model of a Bayes mixed common factor analyzer is built for to-be-clustered high dimensional data; secondly, posteriori distributions of various random variables of the model are subjected to inference, and statistics relevant to the random variables can be obtained; and finally, categories which each dimensional datum belongs to can be obtained through judgment, and the clustering process can be completed. According to the invention, the built Bayes mixed common factor analyzer model has strong flexibility; as the method is based on the inference procedure of Bayes criterion, the phenomenon of overfitting and a dimensionality disaster can be prevented effectively; the method can automatically adjust an optimal structure of the model according to the high dimensional data, so that optimal category data can be confirmed automatically to finish clustering smoothly while performing dimensionality reduction, and excellent clustering performance and computational efficiency can be obtained.

Description

Technical field [0001] The invention involves a clustering method based on the Bayesian hybrid public factor analyzer, which is a processing method and application technology in high -dimensional data. [0002] Background technique [0003] With the continuous development of collection and storage technology, high -dimensional and ultra -high -dimensional data has continued to emerge.For example, tens of thousands of face images commonly common in image retrieval and document search and the inevitable high -vitamin vector, voice and audio signal processing of hundreds of thousands of web texts, voice and audio signalsPerform high -vitamin expression data in cluster analysis, and so on.Obviously, the higher the number of dimensions (the more attributes of the object), you can more comprehensively portray the described objects and better distinguish the object.However, when the data sample is not large, the excessive dimension inevitably has a severe challenge to the processing of ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 魏昕李宗辰
Owner INFORMATION & COMM BRANCH OF STATE GRID JIANGSU ELECTRIC POWER
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products