Abstract
In this paper we argue that incrementally updating the features that a text classification algorithm considers is very important for real-world textual data streams, because in most applications the distribution of data and the description of the classification concept changes over time. We propose the coupling of an incremental feature ranking method and an incremental learning algorithm that can consider different subsets of the feature vector during prediction (what we call a feature based classifier), in order to deal with the above problem. Experimental results with a longitudinal database of real spam and legitimate emails shows that our approach can adapt to the changing nature of streaming data and works much better than classical incremental learning algorithms.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Sahami, M., Dumais, S., Heckerman, D., Horvitz, E.: A bayesian approach to filtering junk E-mail. In: Learning for Text Categorization: Papers from the 1998 Workshop, Madison, Wisconsin, AAAI Technical Report WS-98-05 (1998)
Chandrinos, K.V., Androutsopoulos, I., Paliouras, G., Spyropoulos, C.D.: Automatic web rating: Filtering obscene content on the web. In: Borbinha, J.L., Baker, T. (eds.) Proceedings of ECDL 2000, 4th European Conference on Research and Advanced Technology for Digital Libraries, Lisbon, Portugal, pp. 403–406. Springer, Heidelberg (2000)
Lee, P.Y., Hui, S.C., Fong, A.C.M.: Neural networks for web content filtering. IEEE Intelligent Systems 17, 48–57 (2002)
Lang, K.: Newsweeder: Learning to filter netnews. In: Proceedings of the 12th International Conference on Machine Learning, Tahoe City, California, pp. 331–339 (1995)
Fetterly, D., Manasse, M., Najork, M.: Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. In: WebDB 2004: Proceedings of the 7th International Workshop on the Web and Databases, pp. 1–6. ACM Press, New York (2004)
Fawcett, T.: “in vivo” spam filtering: A challenge problem for data mining. KDD Explorations 5 (2003)
Widmer, G., Kubat, M.: Learning in the presence of concept drift and hidden contexts. Machine Learning 23, 69–101 (1996)
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Fisher, D.H. (ed.) Proceedings of ICML 1997, 14th International Conference on Machine Learning, Nashville, US, pp. 412–420. Morgan Kaufmann, San Francisco (1997)
Forman, G.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, 1289–1305 (2003)
Larkey, L.S.: A patent search and classification system. In: Fox, E.A., Rowe, N. (eds.) Proceedings of DL 1999, 4th ACM Conference on Digital Libraries, Berkeley, US, pp. 179–187. ACM Press, New York (1999)
Schapire, R.E., Singer, Y., Singhal, A.: Boosting and Rocchio applied to text filtering. In: Croft, W.B., Moffat, A., van Rijsbergen, C.J., Wilkinson, R., Zobel, J. (eds.) Proceedings of SIGIR 1998, 21st ACM International Conference on Research and Development in Information Retrieval, Melbourne, AU, pp. 215–223. ACM Press, New York (1998)
Fung, G.: The disputed federalist papers: Svm feature selection via concave minimization. In: TAPIA 2003: Proceedings of the 2003 conference on Diversity in computing, pp. 42–46. ACM Press, New York (2003)
Peng, X., Choi, B.: Automatic web page classification in a dynamic and hierarchical way. In: IEEE International Conference on Data Mining, pp. 386–393 (2002)
Clark, J., Koprinska, I., Poon, J.: A neural network based approach to automated e-mail classification. In: IEEE/WIC International Conference on Web Intelligence (WI 2003), pp. 702–705 (2003)
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34, 1–47 (2002)
Lewis, D.D.: An evaluation of phrasal and clustered representations on a text categorization task. In: SIGIR 1992: Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 37–50. ACM Press, New York (1992)
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24, 513–523 (1988)
John, G.H., Langley, P.: Estimating continuous distributions in Bayesian classifiers. In: Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence, pp. 338–345. Morgan Kaufmann, San Francisco (1995)
Witten, I., Frank, E.: Data Mining: Practical machine learning tools with Java implementations. Morgan Kaufmann, San Francisco (1999)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Katakis, I., Tsoumakas, G., Vlahavas, I. (2005). On the Utility of Incremental Feature Selection for the Classification of Textual Data Streams. In: Bozanis, P., Houstis, E.N. (eds) Advances in Informatics. PCI 2005. Lecture Notes in Computer Science, vol 3746. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11573036_32
Download citation
DOI: https://doi.org/10.1007/11573036_32
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29673-7
Online ISBN: 978-3-540-32091-3
eBook Packages: Computer ScienceComputer Science (R0)