Supercharge Your Innovation With Domain-Expert AI Agents!

High-dimensional feature data classification method and system based on distributed parallel decision tree

A technology of feature data and classification methods, which is applied in special data processing applications, relational databases, database models, etc., can solve problems such as inability to efficiently process high-dimensional feature data, achieve the effect of shortening the establishment time and improving parallel efficiency

Active Publication Date: 2020-06-09
INST OF COMPUTING TECH CHINESE ACAD OF SCI
View PDF8 Cites 3 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0009] The purpose of the present invention is to overcome the problem that the above-mentioned existing parallel decision tree algorithm cannot efficiently process high-dimensional feature data, and proposes a parallel decision tree algorithm that processes in parallel at the node and feature levels at the same time. With the same efficiency, it can effectively improve the processing efficiency of high-dimensional feature data

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • High-dimensional feature data classification method and system based on distributed parallel decision tree
  • High-dimensional feature data classification method and system based on distributed parallel decision tree
  • High-dimensional feature data classification method and system based on distributed parallel decision tree

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0037] When the inventors conduct large-scale data mining research, they find that the data dimension is very large, and the existing decision tree algorithm cannot handle this data well. The reason is that the serial decision tree cannot handle large-scale data, and the existing parallel decision tree algorithm has a low degree of parallelism, and the fastest algorithm is only parallel at the node level, but not in the optimal feature selection part. In the case of large feature dimensions and many feature values, using a multi-fork decision tree will lead to too many decision tree nodes, resulting in excessive memory usage and overfitting. Using a binary decision tree must divide all possible nodes. Traversal, finding the information gain of each division and deciding the optimal node will also bring a lot of time consumption. Existing parallel decision tree algorithms do not take this into account, because naturally occurring data rarely have particularly large feature dime...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a high-dimensional feature data classification method and system based on a distributed parallel decision tree. A parallel decision tree algorithm oriented to high-dimensional feature data based on Spark is realized; the parallel algorithm is high in degree of parallelism, can process a large-scale data set, not only can perform parallel calculation between nodes on the samelayer in a decision tree, but also can perform parallel calculation on a feature level, improves the degree of parallelism of high-dimensional data, and can effectively reduce the processing time ofhigh-dimensional features.

Description

technical field [0001] The invention relates to the field of tree classification, and in particular to a method and system for classifying high-dimensional feature data based on distributed parallel decision trees. Background technique [0002] The decision tree classification algorithm is an instance-based inductive learning method, which can extract a tree-type classification model from a given unordered training sample. Each non-leaf node in the tree records which feature is used to judge the category, and each leaf node represents the last category judged. A classified path rule is formed from the root node to each leaf node. When testing a new sample, you only need to start from the root node, test at each branch node, and recursively enter the subtree along the corresponding branch to test again until you reach the leaf node. The category represented by the leaf node is is the predicted category of the current test sample. Quinlan proposed the famous ID3 algorithm i...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06K9/62G06F16/27G06F16/28G06F16/2458
CPCG06F16/27G06F16/285G06F16/2462G06F18/24323Y02D10/00
Inventor 孙莹庄福振敖翔何清
Owner INST OF COMPUTING TECH CHINESE ACAD OF SCI
Features
  • R&D
  • Intellectual Property
  • Life Sciences
  • Materials
  • Tech Scout
Why Patsnap Eureka
  • Unparalleled Data Quality
  • Higher Quality Content
  • 60% Fewer Hallucinations
Social media
Patsnap Eureka Blog
Learn More