Double-window concept drift detection method based on sample distribution statistical test

A technology for statistical testing and sample distribution, applied in the field of machine learning, which can solve problems such as drift misjudgment, drift omission, and increase the update burden of the learner.

Active Publication Date: 2020-01-21
BEIJING UNIV OF TECH
View PDF10 Cites 8 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0018] To analyze whether the differences among the three test results are significant, the confidence levels of the three tests should be set in advance The confidence level represents the acceptable error range of the hypothesis test. When the confidence level is too small, the distribution test is sensitive to concept changes, and when there is a slight difference between samples, it is difficult to pass the test, resulting in drift and misjudgment, which increases the burden on the learner update; when the confidence level When it is too large, the distribution test is tolerant to concept changes, resulting in drift and missed judgments, and the prediction effect is reduced

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Double-window concept drift detection method based on sample distribution statistical test
  • Double-window concept drift detection method based on sample distribution statistical test
  • Double-window concept drift detection method based on sample distribution statistical test

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0067] In order to verify the performance of this method, this paper selects the cement strength benchmark data set for testing. The data comes from the learning team of Prof. I-Cheng Yeh and can be obtained by visiting UCI (https: / / archive.ics.uci.edu / ). The data set contains a total of 1030 samples. The input variables are the main factors that directly or indirectly affect the compressive strength of cement, namely cement (Cement), blast furnace slag (Blast Furnace Slag), fly ash (Fly Ash), water (Water) , Superplasticizer, Coarse Aggregate, Fine Aggregate, Age, and the output is Concrete compressive strength.

[0068] Firstly, the data set is divided into two sub-data sets, which respectively contain the first 500 groups and the last 500 groups of data in the original data set, and then the two sub-data sets are divided into five equal intervals, each containing 100 groups of data. In this paper, the first data in the first sub-dataset is used as the training set for model...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a double-window concept drift detection method based on sample distribution statistical test, and belongs to the field of machine learning. Aiming at the problem of concept drift of data streams along with time attributes, the method comprises the following steps: firstly, carrying out outlier detection in a fixed window by adopting support vector regression (SVR); then, for the detected outliers, calculating the Euclidean distance between new and old samples in a variable window, and according to the Euclidean distance, performing statistical analysis in combination with multiple distribution inspection methods to indirectly reflect whether data distribution is changed or not so as to determine whether drifting occurs or not; and finally, verifying the effectiveness of the method on a cement strength reference data set and an urban solid waste incineration (MSWI) outlet nitrogen oxide concentration data set.

Description

technical field [0001] The dual-window concept drift detection method based on the statistical test of sample distribution belongs to the field of machine learning. Background technique [0002] At present, the research work of machine learning mainly focuses on non-incremental batch learning. The learning method is to pack the collected data into data sets according to batches, and train the base learner intensively. With the massive growth of data, the use of traditional data sets to read and process data will increase the storage cost of data. At the same time, the way of centralized training makes data lag, which cannot reflect the current working conditions in a timely manner, nor can the data be updated at any time. Reasonable feedback on time changes. The online learning algorithm updates the learner based on a single sample or a batch of samples, and then expects to obtain a hypothesis based on all current samples, which is more suitable for practical problems. [...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06K9/62G06N20/00
CPCG06N20/00G06F18/2411G06F18/2433
Inventor 乔俊飞孙子健汤健
Owner BEIJING UNIV OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products