Sorting algorithms for high dimensional data

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A classification algorithm and high-dimensional data technology, applied in the field of data processing, can solve problems such as unbalanced classification accuracy, inability to obtain good classification results, and inability to effectively deal with high-dimensional data subspace selection problems.

Inactive Publication Date: 2017-10-20

SHANGHAI FENGBAO INFORMATION TECH CO LTD

View PDF0 Cites 24 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0002] With the rapid development of science and technology such as the Internet and the Internet of Things, the scale of data generated by many practical applications is getting larger and higher, and the dimension is getting higher and higher. Due to the ubiquity and importance of high-dimensional data, it is very important to study high-dimensional data mining. Therefore, it has caused an upsurge of high-dimensional data mining research in academia. The current research focus is mainly on the classification of high-dimensional data. Due to the influence of dimensionality, many classification methods that perform well on low-dimensional data are used. Good classification results cannot be obtained on high-dimensional data. Therefore, high-dimensional data classification is a new challenge for traditional classification techniques. High-dimensional data classification is of great significance. It can help people automatically classify the categories to which data belongs. , for example, the automatic classification of news texts, the automatic identification of the function of gene expression, the distinction of users with different interests in network marketing, the filtering of useful information in public opinion analysis, etc. However, for the classification of high-dimensional data, the current research results and There are still few technologies, and there are many research problems to be solved in high-dimensional data classification. Therefore, research on high-dimensional data classification mainly considers the problem of high-dimensional data classification with multi-noise attributes, multi-domain transfer learning problems, multi-class label learning problems, Multi-instance multi-class label learning problem, a series of algorithms are proposed to solve these four research problems,

[0003] Random forest is an efficient classification method, but the random forest method based on random sampling cannot effectively deal with the subspace selection problem of high-dimensional data. Therefore, this chapter proposes a new hierarchical sampling subspace selection method, which combines The strong attribute set and weak attribute set are used for random forest subspace sampling. The experimental results prove that the hierarchical random forest method can obtain better classification performance than other random forest algorithms, and it provides an effective method for high-dimensional data classification. Aiming at the problem of high-dimensional data classification with multi-noise attributes, a random forest algorithm based on stratified sampling is proposed. By summarizing the research problems of high-dimensional data classification with multi-noise attributes and analyzing the shortage of random sampling subspace selection methods, the research The stratified sampling subspace selection method, the paper proposes the corresponding stratified sampling random forest algorithm for different high-dimensional data classification problems, and proposes the SRF algorithm for high-dimensional data classification with multi-noise attributes, which can effectively Handle the classification of high-dimensional data such as genes and images; for the classification of unbalanced text data, the Fores Texter algorithm is proposed, which effectively improves the classification accuracy of small categories in unbalanced data; for the analysis of whole genome data, the GWA-SRF algorithm is proposed, This method avoids the calculation cost of exhaustive search for the optimal subspace size, and at the same time can automatically calculate the importance of attributes and discover important attributes related to research diseases.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0081] In order to make the technical means, creative features, goals and effects achieved by the present invention easy to understand, the present invention will be further elaborated below in conjunction with illustrations and specific embodiments.

[0082] see figure 1 , the present invention provides a classification algorithm for high-dimensional data: including random forest algorithm, ForesTexter algorithm, GW-SRF algorithm;

[0083] Random forest is a decision tree ensemble learning model. The algorithm flow of random forest can be described as follows:

[0084] 1). First, use the Bagging sampling method to sample the training data set X with replacement (Sample with replacement), and obtain K data subsets {X1, X2,...,XK};

[0085] 2). For each training data subset Xk, use the CART [91] method to construct a decision tree. For each node of the decision tree, randomly select p attributes (p≤N) from the attribute space S as a set of attributes subset (attribute subspac...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses sorting algorithms for high dimensional data. According to the invention, the algorithms include a random forest algorithm, a Fores Texter algorithm and a GW-SRF algorithm. Each node in a hierarchy tree can perform learning and multi-class label forecast. Forecast labels learned by father nodes can be transmitted to and inherited by son nodes. Sorting of multi-label data can be performed effectively and relation among labels can be found automatically by adopting the label transmission mode among the nodes. Experiments on the multi-label data shows that sorting results of the ML-Tree algorithm are distinctively superior to those of traditional algorithms. Therefore, the sorting algorithms provided by the invention are novel in design and are a good design scheme, and is good in market promotion prospect and high in perception performance.

Description

technical field [0001] The invention relates to the field of data processing, especially the classification algorithm of high-dimensional data. Background technique [0002] With the rapid development of science and technology such as the Internet and the Internet of Things, the scale of data generated by many practical applications is getting larger and higher, and the dimension is getting higher and higher. Due to the ubiquity and importance of high-dimensional data, it is very important to study high-dimensional data mining. Therefore, it has caused an upsurge of high-dimensional data mining research in academia. The current research focus is mainly on the classification of high-dimensional data. Due to the influence of dimensionality, many classification methods that perform well on low-dimensional data are used. Good classification results cannot be obtained on high-dimensional data. Therefore, high-dimensional data classification is a new challenge for traditional clas...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06K9/62

CPCG06F18/2411G06F18/24323

Inventor 李臻

Owner SHANGHAI FENGBAO INFORMATION TECH CO LTD

Sorting algorithms for high dimensional data

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology