Eureka AIR delivers breakthrough ideas for toughest innovation challenges, trusted by R&D personnel around the world.

Outlier data mining method based on feature weighting and MapReduce

A feature weighted, outlier data technology, applied in data mining, special data processing applications, electrical digital data processing, etc., can solve the problems of ambiguous cluster structure, large amount of calculation, large amount of data, etc., to achieve mining efficiency and high precision, overcoming efficiency problems, and the effect of small human factors

Active Publication Date: 2020-09-01
太原太工天宇教育科技有限公司
View PDF11 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

In high-dimensional mass data, due to the large amount of data and high dimensionality, the effect and efficiency of outlier data mining are seriously affected, and some outlier data hidden in the subspace and some outlier data with edge distribution may not be found.
It is precisely because of the clustering characteristics of high-dimensional sparse data sets that the outlier data distribution often exists in a certain subspace instead of the entire feature space, and irrelevant features will make the cluster structure of the data more blurred. If the cluster structure in the data set is well discovered, the outliers will be more difficult to detect, and outlier data mining cannot be realized.
[0003] In addition, in recent years, although the traditional outlier data mining algorithms have made a lot of improvements in their respective fields, they are no longer applicable in high-dimensional data sets, and the calculation load is large, and the mining efficiency and accuracy are low. Therefore, how to target Accurate mining of big data, high-dimensional data, and outlier data is a major problem in outlier data mining.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Outlier data mining method based on feature weighting and MapReduce
  • Outlier data mining method based on feature weighting and MapReduce
  • Outlier data mining method based on feature weighting and MapReduce

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0028] For the mining of high-dimensional and massive data, the scheme of the present invention provides the following method steps:

[0029] Step 1: Based on the feature weighted subspace, the subspace data is separated into cluster centers, clusters and candidate outlier data sets under the programming model; Step 2: Calculate the global distance for the outlier data set described in step 1, Then define the outlier data.

[0030]Preferably, in step one, the feature weighted subspace is obtained after defining the feature weighted estimation entropy on the attribute dimension, and then under the MapReduce programming model, the subspace data set is quickly separated by using the density peak algorithm; in step two, the The calculation of the global distance includes calculating its global Weight_k distance, and the calculation of the Weight_k distance also includes a process of sorting the Weight_k distance set in descending order and outputting TOP-N data. Further, in the f...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to the technical field of data mining, in particular to an outlier data mining method based on feature weighting and MapReduce, which comprises the following steps: step 1, basedon a feature weighting subspace, separating subspace data into a clustering center, a clustering cluster and a candidate outlier data set under a MapReduce programming model; and 2, calculating a global distance for the outlier data set in the step 1, and then defining outlier data. According to the invention, the outlier data mining method is reasonable in calculation amount; human factors are small, and the digging efficiency and precision are high; for high-dimensional mass data, feature dimensions which cannot provide valuable information in the high-dimensional data set are automaticallysearched and deleted, and the interference of dimension disaster is effectively reduced; the invention provides a technical scheme of a high-dimensional massive outlier data mining method which is simple in system, relatively high in accuracy and excellent in performance, so that the efficiency problem in outlier detection is relatively well overcome, and the method has profound application and influence in the field of informationized big data.

Description

technical field [0001] The present invention relates to the technical field of data mining, in particular to a method for outlier data mining Background technique [0002] Outlier data is the data that obviously deviates from other data, does not meet the general pattern or behavior of the data, and is inconsistent with other existing data. It often contains a lot of valuable information that is not easy to be discovered by people. As an important branch of data mining, outlier data mining has been widely used in securities market, astronomical spectrum data analysis, network intrusion, financial fraud, extreme weather analysis and other fields. In high-dimensional mass data, due to the large amount of data and high dimensionality, the effect and efficiency of outlier data mining are seriously affected, and some outlier data hidden in the subspace and some outlier data with edge distribution may not be found. It is precisely because of the clustering characteristics of high...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/2458G06K9/62G06F16/215
CPCG06F16/2465G06F16/215G06F2216/03G06F18/2321G06F18/22
Inventor 朱晓军吕士钦娄圣金
Owner 太原太工天宇教育科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Eureka Blog
Learn More
PatSnap group products