Tracking Recurring Contexts using Ensemble Classifiers: An Application to Email Filtering

I. Katakis, G. Tsoumakas, I. Vlahavas, “Tracking Recurring Contexts using Ensemble Classifiers: An Application to Email Filtering”, Knowledge and Information Systems, Springer, 22(3), pp. 371-391, 2010.

You are here: Home » Publications » Tracking Recurring Contexts using Ensemble Classifiers: An Application to Email Filtering

Author(s): I. Katakis, Grigorios Tsoumakas, I. Vlahavas

Availability:

Appeared In: Knowledge and Information Systems, Springer, 22(3), pp. 371-391, 2010.

Keywords: data streams, classification, concept drift, text mining, text classification, recurring contexts, recurring themes, text streams, email mining, email classification.

Tags:

2010, Katakis, Tsoumakas, Vlahavas

Abstract: Concept drift constitutes a challenging problem for the machine learning and data mining community that frequently appears in real world stream classification problems. It is usually defined as the unforeseeable concept change of the target variable in a prediction task. In this paper, we focus on the problem of recurring contexts, a special sub-type of concept drift, that has not yet met the proper attention from the research community. In the case of recurring contexts, concepts may re-appear in future and thus older classification models might be beneficial for future classifications. We propose a general framework for classifying data streams by exploiting stream clustering in order to dynamically build and update an ensemble of incremental classifiers. To achieve this, a transformation function that maps batches of examples into a new conceptual representation model is proposed. The clustering algorithm is then applied in order to group batches of examples into concepts and identify recurring contexts. The ensemble is produced by creating and maintaining an incremental classifier for every concept discovered in the data stream. An experimental study is performed using a) two new real-world concept drifting datasets from the email domain, b) an instantiation of the proposed framework and c) five methods for dealing with drifting concepts. Results indicate the effectiveness of the proposed representation and the suitability of the concept-specific classifiers for problems with recurring contexts.