I. Katakis, G. Tsoumakas, I. Vlahavas, “On the Utility of Incremental Feature Selection for the Classification of Textual Data Streams”, 10th Panhellenic Conference on Informatics (PCI 2005), P. Bozanis and E.N. Houstis (Eds.), Springer-Verlag, LNCS 3746, pp. 338-348, Volos, Greece, 11-13 November, 2005.
Author(s): I. Katakis, Grigorios Tsoumakas, I. Vlahavas
Keywords: Text Mining, Text Classification, Feature Based Classifiers, Dynamic Feature Space, Dynamic Feature Selection, Data Streams, Concept Drift.
Abstract: In this paper we argue that incrementally updating the fea- tures that a text classification algorithm considers is very important for real-world textual data streams, because in most applications the distri- bution of data and the description of the classification concept changes over time. We propose the coupling of an incremental feature ranking method and an incremental learning algorithm that can consider differ- ent subsets of the feature vector during prediction (what we call a feature based classifier), in order to deal with the above problem. Experimental results with a longitudinal database of real spam and legitimate emails shows that our approach can adapt to the changing nature of streaming data and works much better than classical incremental learning algorithms.