Learning under concept drift: an overview
iTechs – ISCAS
p What’s Concept Drift
p Causes of a Concept Drift
p Types of Concept Drift
p Detecting and Handling Concept Drift
p Implications for Software Engineering Research
l is a vector in p-dimensional feature space
observed at time t and yt is the corresponding label.
l We call Xt an instance and a pair (Xt; yt) a labeled
instance. We refer to instances (X1; : : : ;Xt) as
historical data and instance Xt+1 as target (or testing)
l The task is to predict a label yt+1 for the target
p Concept Drift
l Every instance Xt is generated by a source St.
l If all the data is sampled from the same source, i.e. S1
= S2 = : : : = St+1 = S we say that the concept is stable.
l If for any two time points i and j Si != Sj , we say that
there is a concept drift.
Causes of Concept Drift
p Let is an instance in p-dimensional
feature space. , where c1, c2,….ck is the
set of class labels.
p The optimal classier to classify is
determined by a prior probabilities for the
classes P(ci) and the class-conditional
probability density functions p(X | ci), i = 1,….k.
p Concept /data source:
l a set of a prior probabilities of the classes and class-
Causes of Concept Drift (cont.)
p Concept drift may occur in three ways:
l Class priors P(c) might change over time.
l The distributions of one or several classes p(X|ci)
might change. (virtual drift)
l The posterior distributions of the class memberships
p(ci|X) might change.(real drift)
Types of Concept Drift
l Sudden drift
l Gradual drift
l Incremental drift
l reoccurring contexts
Detecting and Handling Concept Drift
l Monitoring the raw data
l Monitoring parameters of learners
l Monitoring prediction errors of learners
l Ensemble learning
l Instance selection
l Instance weights
l Training windows
l Training windows are naturally suitable for sudden concept
drift, while ensembles are more flexible in terms of change
Detecting and Handling Concept Drift (cont.)
p Overall solution for learning under concept drift
Implications for SE Research
p Concept drift is a fundamental issue for SE
l Cost estimation, defect prediction…
l Especially in the cross-company/cross-project context
l Be harmful to performance of prediction models
p Detecting and handling concept drift is a
l Quality problems of SE data, e.g., insufficient data
l Data generation context is highly unstable.
p Has become a increasingly popular research
topic in SE field!
l E.g., Burak Turhan [JESE 2012], Jayalath Ekanayake
[MSR 2009, JESE 2011]
1.Indre Zliobaite, “Learning under Concept Drift an
Overview,” Tech-report, 2009
2.A. Dries and R. Ulrich, “Adaptive Concept Drift
Detection,” Journal of Statictical Analysis and Data
3.L. Minku, A. White, and X. Yao. “The impact of diversity
on on-line ensemble learning in the presence of concept
drift.” IEEE Transactions on Knowledge and Data
4.M. Kelly, D. Hand, and N. Adams. “The impact of
changing populations on classier performance.”