Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Distributed column subset selection method and system and leukemia gene information mining method

A column subset and distributed technology, applied in the field of big data processing, can solve the problems of time-consuming calculation, algorithm failure to achieve linear acceleration, poor reliability, etc.

Pending Publication Date: 2021-07-06
HUNAN UNIV +1
View PDF0 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0009] 3) The two-stage algorithm cannot achieve linear acceleration in theory, so in practice, the calculation of this algorithm is very time-consuming and the practicability is not strong
[0010] Furthermore, the two-stage algorithm assumes that all subsets have the same quality, which is the main reason for the above inadequacies
In fact, the quality between subsets is often different, ignoring this difference will lead to waste of time and resources, and even affect the final result of feature selection
Therefore, the current distributed feature selection method for column subset selection still has the problems of low accuracy, slow calculation speed and poor reliability.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Distributed column subset selection method and system and leukemia gene information mining method
  • Distributed column subset selection method and system and leukemia gene information mining method
  • Distributed column subset selection method and system and leukemia gene information mining method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0045] The two-stage algorithm assumes that all subsets have the same quality and is the main reason for the inadequacy of existing techniques. However, in fact, the quality between subsets is often different, ignoring this difference will lead to waste of time and resources, and even affect the final result of feature selection. In this application, the quality of the subset is measured by the number of optimal features in the subset. Specifically, there must be k most representative features in a data set, called k optimal features, the more optimal features a subset contains, the higher the quality of the subset. For the CSS problem, define the optimal solution S OPT is a set containing k optimal features, and the combination of k features is the feature combination with the strongest ability to fit the original data set among all feature combinations.

[0046] Such as figure 1 It is a schematic flow chart of the method of the present invention: the distributed column su...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a distributed column subset selection method. The method comprises the following steps: acquiring and processing all features in a data set, and uniformly grouping the features to computing nodes; executing a subset quality evaluation method on each computing node and obtaining a corresponding feature subset target feature number; allowing each computing node to perform respective feature selection calculation to obtain features selected by each computing node; and summarizing feature selection calculation results of the calculation nodes to obtain finally selected features. The invention further discloses a system based on the distributed column subset selection method and a leukemia gene information mining method based on the method and the system. According to the method, redundant features are effectively prevented from being selected in a subset, and the feature selection process is accelerated; the selected features are directly summarized as a final selection result, so that the method at least can achieve linear acceleration theoretically; the accuracy is high, the calculation speed is high, and the reliability is better; meanwhile, gene characteristics of leukemia and relevance of leukemia are obtained.

Description

technical field [0001] The invention belongs to the field of big data processing, and in particular relates to a distributed column subset selection method, system and leukemia gene information mining method. Background technique [0002] With the emergence of emerging computer applications such as the Internet of Things, machine learning, computer vision, and natural language processing, people often encounter high-dimensional data with massive numbers of samples and features. Processing these high-dimensional data requires more computing and storage resources, which often cannot be processed by a single machine, and most of the features in these data may be useless and redundant. Therefore, selecting representative features from high-dimensional data and serving computer applications has become an urgent problem to be solved. Therefore, as a method that can effectively select representative features from the original feature set, feature selection technology has become th...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G16B35/20G16B40/00G16H50/70
CPCG16H50/70G16B35/20G16B40/00
Inventor 肖正魏鹏程
Owner HUNAN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products