View on GitHub

parallel-ADWIN

Scalable Detection of Concept Drifts on Data Streams with Parallel Adaptive Windowing

Scalable Detection of Concept Drifts on Data Streams with Parallel Adaptive Windowing

This repository contains a description and source code for Scalable Detection of Concept Drifts on Data Streams with Parallel Adaptive Windowing (Parallel Adwin).

Parallel Adwin at EDBT 2017

Parallel Adwin was first published at the 21th International Conference on Extending Database Technology (EDBT) in March 2018.

Abstract: Machine Learning (ML) techniques for data stream analysis suffer from concept drifts such as changed user preferences, varying weather conditions, or economic changes. These concept drifts cause wrong predictions and lead to incorrect business decisions. Concept drift detection methods such as adaptive windowing (Adwin) allow for adapting to concept drifts on the fly. In this paper, we examine Adwin in detail and point out its throughput bottlenecks. We then introduce several parallelization alternatives to address these bottlenecks. Our optimizations increase the throughput of the original Adwin approach by two orders of magnitude. Thus, we explore parallel adaptive windowing to provide scalable concept drift detection for high-velocity data streams with millions of tuples per second.

Publication:

How to run

Import the maven project into your favorite IDE, and run the MicroBenchmark.java as an application. You need to provide the following arguments in order to specify a variant of the Adwin algorithm and a type of data stream (see further below for an explanation of the arguments):

adwinType changeType batchSize numConstant numChange adwinDelta

This will perform a benchmark of the specified Adwin variant with the specified type of data stream. The two-step, JMH-based benchmark works as follows:

The resulting output to the console shows the processing time for every iteration as well as the number of Adwin cut checks performed in every iteration.

adwinType

This argument specifies the variant of Adwin to be benchmarked and may be any of ORIGINAL SERIAL HALFCUT SNAPSHOT.

changeType

This argument specifies the type of concept drift appearing in the data stream and may be any of CONSTANT ABRUPT INCREMENTAL GRADUAL OUTLIER.

For further details about the different types of concept drift, we refer to A Survey on Concept Drift Adaptation, pp. 6 by J. Gama et al. All data streams are periodic (see the following arguments).

batchSize

This argument specifies the number of tuples processed per iteration.

numConstant

This argument specifies the number of constant tuple values that are generated by a data stream between concept drift (or outlier) events. Therefore, it allows to set the rate of concept drifts in the stream. It is ignored if changeType is CONSTANT.

numChange

This argument specifies the number of tuples that an incremental or gradual concept drift lasts. Therefore it is only used if changeType is INCREMENTAL or GRADUAL.

adwinDelta

This argument specifies the confidence value delta which is an input parameter for the Adwin algorithm and determines the sensitivity of Adwin to (perceived) concept drifts.