Unlock instant, AI-driven research and patent intelligence for your innovation.

Sorting algorithms for high dimensional data

A classification algorithm and high-dimensional data technology, applied in the field of data processing, can solve problems such as unbalanced classification accuracy, inability to obtain good classification results, and inability to effectively deal with high-dimensional data subspace selection problems.

Inactive Publication Date: 2017-10-20
SHANGHAI FENGBAO INFORMATION TECH CO LTD
View PDF0 Cites 24 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0002] With the rapid development of science and technology such as the Internet and the Internet of Things, the scale of data generated by many practical applications is getting larger and higher, and the dimension is getting higher and higher. Due to the ubiquity and importance of high-dimensional data, it is very important to study high-dimensional data mining. Therefore, it has caused an upsurge of high-dimensional data mining research in academia. The current research focus is mainly on the classification of high-dimensional data. Due to the influence of dimensionality, many classification methods that perform well on low-dimensional data are used. Good classification results cannot be obtained on high-dimensional data. Therefore, high-dimensional data classification is a new challenge for traditional classification techniques. High-dimensional data classification is of great significance. It can help people automatically classify the categories to which data belongs. , for example, the automatic classification of news texts, the automatic identification of the function of gene expression, the distinction of users with different interests in network marketing, the filtering of useful information in public opinion analysis, etc. However, for the classification of high-dimensional data, the current research results and There are still few technologies, and there are many research problems to be solved in high-dimensional data classification. Therefore, research on high-dimensional data classification mainly considers the problem of high-dimensional data classification with multi-noise attributes, multi-domain transfer learning problems, multi-class label learning problems, Multi-instance multi-class label learning problem, a series of algorithms are proposed to solve these four research problems,
[0003] Random forest is an efficient classification method, but the random forest method based on random sampling cannot effectively deal with the subspace selection problem of high-dimensional data. Therefore, this chapter proposes a new hierarchical sampling subspace selection method, which combines The strong attribute set and weak attribute set are used for random forest subspace sampling. The experimental results prove that the hierarchical random forest method can obtain better classification performance than other random forest algorithms, and it provides an effective method for high-dimensional data classification. Aiming at the problem of high-dimensional data classification with multi-noise attributes, a random forest algorithm based on stratified sampling is proposed. By summarizing the research problems of high-dimensional data classification with multi-noise attributes and analyzing the shortage of random sampling subspace selection methods, the research The stratified sampling subspace selection method, the paper proposes the corresponding stratified sampling random forest algorithm for different high-dimensional data classification problems, and proposes the SRF algorithm for high-dimensional data classification with multi-noise attributes, which can effectively Handle the classification of high-dimensional data such as genes and images; for the classification of unbalanced text data, the Fores Texter algorithm is proposed, which effectively improves the classification accuracy of small categories in unbalanced data; for the analysis of whole genome data, the GWA-SRF algorithm is proposed, This method avoids the calculation cost of exhaustive search for the optimal subspace size, and at the same time can automatically calculate the importance of attributes and discover important attributes related to research diseases.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Sorting algorithms for high dimensional data
  • Sorting algorithms for high dimensional data
  • Sorting algorithms for high dimensional data

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0081] In order to make the technical means, creative features, goals and effects achieved by the present invention easy to understand, the present invention will be further elaborated below in conjunction with illustrations and specific embodiments.

[0082] see figure 1 , the present invention provides a classification algorithm for high-dimensional data: including random forest algorithm, ForesTexter algorithm, GW-SRF algorithm;

[0083] Random forest is a decision tree ensemble learning model. The algorithm flow of random forest can be described as follows:

[0084] 1). First, use the Bagging sampling method to sample the training data set X with replacement (Sample with replacement), and obtain K data subsets {X1, X2,...,XK};

[0085] 2). For each training data subset Xk, use the CART [91] method to construct a decision tree. For each node of the decision tree, randomly select p attributes (p≤N) from the attribute space S as a set of attributes subset (attribute subspac...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses sorting algorithms for high dimensional data. According to the invention, the algorithms include a random forest algorithm, a Fores Texter algorithm and a GW-SRF algorithm. Each node in a hierarchy tree can perform learning and multi-class label forecast. Forecast labels learned by father nodes can be transmitted to and inherited by son nodes. Sorting of multi-label data can be performed effectively and relation among labels can be found automatically by adopting the label transmission mode among the nodes. Experiments on the multi-label data shows that sorting results of the ML-Tree algorithm are distinctively superior to those of traditional algorithms. Therefore, the sorting algorithms provided by the invention are novel in design and are a good design scheme, and is good in market promotion prospect and high in perception performance.

Description

technical field [0001] The invention relates to the field of data processing, especially the classification algorithm of high-dimensional data. Background technique [0002] With the rapid development of science and technology such as the Internet and the Internet of Things, the scale of data generated by many practical applications is getting larger and higher, and the dimension is getting higher and higher. Due to the ubiquity and importance of high-dimensional data, it is very important to study high-dimensional data mining. Therefore, it has caused an upsurge of high-dimensional data mining research in academia. The current research focus is mainly on the classification of high-dimensional data. Due to the influence of dimensionality, many classification methods that perform well on low-dimensional data are used. Good classification results cannot be obtained on high-dimensional data. Therefore, high-dimensional data classification is a new challenge for traditional clas...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06K9/62
CPCG06F18/2411G06F18/24323
Inventor 李臻
Owner SHANGHAI FENGBAO INFORMATION TECH CO LTD