Parallel feature selection method based on MapReduce

A feature selection method and feature set technology, applied in knowledge-based model computer systems, special data processing applications, instruments, etc., can solve problems such as inability to handle large-scale data sets

Inactive Publication Date: 2014-02-05
SHANDONG COMP SCI CENTNAT SUPERCOMP CENT IN JINAN
View PDF1 Cites 17 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Conventional feature selection methods cannot handle large-scale datasets

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Parallel feature selection method based on MapReduce
  • Parallel feature selection method based on MapReduce

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0019] The present invention will be further described below in conjunction with embodiment.

[0020] The parallel feature selection method will be the main choice for processing large-scale data. Many parallel algorithms use different parallel processing technologies, such as multithreading, MPI, MapReduce, workflow technology, etc. Different parallel technologies have different performance and scope of application. MPI is suitable for dealing with calculation-intensive problems, especially simulation calculations. Due to its high requirements on the operating environment and complex programming, it is not easy to use in practical applications. MapReduce is a distributed data processing model proposed in the field of information retrieval, and Hadoop is currently the most widely used open source MapReduce software. However, the MapReduce model under the Hadoop architecture does not support iterative Map and Reduce tasks, which are required by many data mining algorithms. Pro...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a parallel feature selection method based on a MapReduce. The method includes a, dividing data into parts; b, distributing the data parts to a Map computing node; c, establishing a selected feature set and a feature set not selected; d, figuring out combined mutual information between vectors and class variables; e, allowing a maximum feature variable of the combined mutual information to serve as a feature vector, and transmitting a feature variable serial number and corresponding combined mutual information values to a Reduce computing node; f, judging whether a feature vector which is equal to the count exists or not; g, comparing the combined mutual information values; h, adding and deleting selected vectors; i, judging the numbers of the selected features. According to the parallel feature selection method based on the MapReduce, a feature vector combination with the largest amount of information can be extracted rapidly and effectively from massive data sets; the method is adaptable to feature extraction of massive data.

Description

technical field [0001] The present invention relates to a parallel feature selection method based on MapReduce, more specifically, to a parallel feature selection method based on MapReduce that can perform fast feature extraction on massive data. Background technique [0002] In recent years, datasets in many fields, such as genome engineering, text classification, image retrieval, and customer relationship management, have become larger and larger both in terms of quantity and the number of feature variables. This can pose serious problems for the scalability and learning performance of many machine learning methods. How to select the most informative combination of feature variables is a crucial issue. Feature selection is the process of selecting a feature subset from the original feature set according to certain evaluation criteria, thereby effectively reducing the dimensionality of the feature variable set. Feature selection can effectively reduce the dimensionality o...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/35G06N5/00
Inventor 孙占全顾卫东李钊赵彦玲
Owner SHANDONG COMP SCI CENTNAT SUPERCOMP CENT IN JINAN
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products