Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Parallel selection method of interactive features for large-scale high-dimensional sequence data

A sequence data, large-scale technology, applied in computational models, biological models, character and pattern recognition, etc., can solve the problems of unstable performance, the final result depends on the data structure, etc., to achieve easy visualization, convenient parallel processing, Efficient effect of feature selection

Active Publication Date: 2021-10-29
NORTHEASTERN UNIV LIAONING
View PDF6 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The advantage is that it can be applied to large-scale data sets and is not affected by marginal effects, but its performance is unstable, and the final result depends on the initial value and the data structure of the entire search space

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Parallel selection method of interactive features for large-scale high-dimensional sequence data
  • Parallel selection method of interactive features for large-scale high-dimensional sequence data
  • Parallel selection method of interactive features for large-scale high-dimensional sequence data

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0056] Embodiments of the present invention will be further described below in conjunction with the accompanying drawings.

[0057] In this embodiment, SNP (Single Nucleotide Polymorphism, SNP) data in biological data is taken as an example for practical application background. Because the verification work of interactive feature subsets is very complicated at this stage, and there are already confirmed sets of SNP sites (features) that cause disease in the biological field, which provides convenience for the final result verification.

[0058] Parallel selection methods for interactive features for large-scale high-dimensional sequence data, such as figure 1 shown, including:

[0059] Step 1. Encode the original sequence data to obtain the data set D.

[0060] The original high-dimensional sequence data in this embodiment is the original SNP data in biological information. Raw SNP data generally comes in two forms: genotype or haplotype data. Taking a certain locus (featu...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides an interactive feature parallel selection method for large-scale high-dimensional sequence data, including: encoding the original high-dimensional SNP data; retaining the encoded SNP data related to the target class through block filtering based on graph theory; Perform fine-grained feature filtering on the SNP data related to the target class; divide the feature-filtered sequence data set into several blocks with γ as the granularity, and obtain the feature candidate area based on the maximum allele common subsequence MACS; corresponding to the candidate area The data set is based on MapReduce for diversity selection of feature regions to obtain representative feature regions; for representative feature regions, the parallel ant colony algorithm of permutation search is used for interactive feature selection to obtain a subset of significant features, that is, significant SNP sites gather. The present invention proposes a brand-new framework for solving interactive feature selection in large-scale sequence data, making feature selection more efficient and more powerful.

Description

technical field [0001] The invention belongs to the technical field of interactive feature selection, in particular to an interactive feature parallel selection method for large-scale high-dimensional sequence data. Background technique [0002] With the continuous advancement and development of data mining and machine learning technology, feature selection technology has received more and more attention. In terms of learning efficiency and learning results, machine learning models have significantly benefited from using only relevant data. The most widely used technique for finding relevant data is feature selection, which is to select a subset of features from the original feature set. The successful application of feature selection also brings new challenges, one of which is to find out the subset of potentially interacting features, because the combination of these features is the subset of features that really affect the target variable (class label). Therefore, the re...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G16B20/20G16B40/00G06N3/00G06K9/46
CPCG06N3/006G06V10/462
Inventor 赵宇海印莹郭文鹏王国仁祁宏伟
Owner NORTHEASTERN UNIV LIAONING
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products