Large-scale data abnormity recognition method based on bidirectional sampling combination

A large-scale data and anomaly identification technology, applied in the field of anomaly identification, can solve the problems of dimensionality disaster, large sample size and time complexity, etc., to reduce the impact of noise, overcome the dimensionality disaster problem, and reduce the scale

Inactive Publication Date: 2015-03-25
北京系统工程研究所
View PDF0 Cites 19 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0026] In order to overcome the shortcomings of the above-mentioned prior art, the present invention provides a large-scale data anomaly identification method based on two-way sampling combination. Through the two-way sampling method, it not only solves the problem of large sample size and high time complexity, but also solves the problem of dimensionality disaster; The data set is segmented by sampling method, which improves the scalability of the method

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Large-scale data abnormity recognition method based on bidirectional sampling combination
  • Large-scale data abnormity recognition method based on bidirectional sampling combination
  • Large-scale data abnormity recognition method based on bidirectional sampling combination

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0077] Take the simulated data set generated by multivariate Gaussian distribution simulation as an example below to illustrate the effect of the method of the present invention:

[0078] First, the simulation data set is generated by multivariate Gaussian distribution simulation. The number of sample points n of each sample data set is 1000, 2000, 5000, 10000, 50000, 100000 respectively, and the dimension m of the sample is 20, 100, 200, 500, 1000 respectively. , 2000, a total of 42 simulation data sets. Each sample data set D consists of c clusters, and the number of clusters c ranges from 5 to 10. Assume that in the simulation data set, the sample points D of each cluster c All obey the m-element Gaussian distribution, namely D c : N ( μ r c , Σ c ) ...

Embodiment 2

[0084] Take the real data set as an example below to illustrate the effect of the method of the present invention:

[0085] The real data sets are all selected from the UCI database, and Table 1 gives a description of the characteristics of all the data sets involved in the experiment. In order to simulate the abnormal situation in the data set, we randomly select s ∈ [10, 100] points from the smallest class of each data set to mark as the abnormal points of the data set, and the remaining points are marked as normal points. Since the method of the present invention is not suitable for the analysis of discrete attributes, it is necessary to eliminate the discrete attributes in some real data sets. Same as Example 1, this example uses the area under the ROC curve (AUC) to evaluate the effect of different methods of the present invention.

[0086] Table 1

[0087] dataset name

Sample points

number of attributes

number of classes

minimal class

large...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a large-scale data abnormity recognition method based on bidirectional sampling combination. The method includes the following steps of carrying out crosswise sampling on a sample data set to obtain a sub-sample data set, carrying out attribute sampling on the sub-sample data set to obtain a stripe data set, carrying out abnormity degree grading on the stripe data set, repeating the above steps, combining abnormity degree scores and calculating values of expectation of the abnormity degree scores. Through the bidirectional sampling method, the large-scale data abnormity recognition method solves the problems that the number of the samples is large and the complexity is high and also solves the problem in curse of dimensionality; the data set is cut based on the sampling method, and therefore the expansibility of the large-scale data abnormity recognition method is improved.

Description

technical field [0001] The invention relates to an abnormal identification method, in particular to a large-scale data abnormal identification method based on bidirectional sampling combination. Background technique [0002] Outlier Detection is a detection method for outlier sample points in a data set. Anomalies have rich connotations and may be noise, errors, or rare values. In the field of data mining, its generally accepted definition is a point that is generated by other mechanisms and deviates from most observations (Observation). In this paper, the point opposite to the "Outlier" is called "Inlier". [0003] As an important research direction, anomaly recognition has been widely used in real-world applications such as credit card fraud recognition, disease diagnosis and prevention, network intrusion detection, measurement error inspection, and rare value recognition. [0004] (1) Anomaly identification method based on statistics [0005] Since the 1980s, the prob...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/2453G06F16/24564G06F16/2457
Inventor 张玉超邓波彭甫阳李海龙
Owner 北京系统工程研究所
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products