On the Utility of Incremental Feature Selection for the Classification of Textual Data Streams

Katakis, Ioannis; Tsoumakas, Grigorios; Vlahavas, Ioannis

doi:10.1007/11573036_32

Ioannis Katakis¹⁸,
Grigorios Tsoumakas¹⁸ &
Ioannis Vlahavas¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3746))

Included in the following conference series:

Panhellenic Conference on Informatics

2174 Accesses

Abstract

In this paper we argue that incrementally updating the features that a text classification algorithm considers is very important for real-world textual data streams, because in most applications the distribution of data and the description of the classification concept changes over time. We propose the coupling of an incremental feature ranking method and an incremental learning algorithm that can consider different subsets of the feature vector during prediction (what we call a feature based classifier), in order to deal with the above problem. Experimental results with a longitudinal database of real spam and legitimate emails shows that our approach can adapt to the changing nature of streaming data and works much better than classical incremental learning algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

A comparative study of feature selection methods for binary text streams classification

Article 17 October 2020

On Dynamic Feature Weighting for Feature Drifting Data Streams

Discriminant Analysis on a Stream of Features

References

Sahami, M., Dumais, S., Heckerman, D., Horvitz, E.: A bayesian approach to filtering junk E-mail. In: Learning for Text Categorization: Papers from the 1998 Workshop, Madison, Wisconsin, AAAI Technical Report WS-98-05 (1998)
Google Scholar
Chandrinos, K.V., Androutsopoulos, I., Paliouras, G., Spyropoulos, C.D.: Automatic web rating: Filtering obscene content on the web. In: Borbinha, J.L., Baker, T. (eds.) Proceedings of ECDL 2000, 4th European Conference on Research and Advanced Technology for Digital Libraries, Lisbon, Portugal, pp. 403–406. Springer, Heidelberg (2000)
Chapter Google Scholar
Lee, P.Y., Hui, S.C., Fong, A.C.M.: Neural networks for web content filtering. IEEE Intelligent Systems 17, 48–57 (2002)
Article Google Scholar
Lang, K.: Newsweeder: Learning to filter netnews. In: Proceedings of the 12th International Conference on Machine Learning, Tahoe City, California, pp. 331–339 (1995)
Google Scholar
Fetterly, D., Manasse, M., Najork, M.: Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. In: WebDB 2004: Proceedings of the 7th International Workshop on the Web and Databases, pp. 1–6. ACM Press, New York (2004)
Chapter Google Scholar
Fawcett, T.: “in vivo” spam filtering: A challenge problem for data mining. KDD Explorations 5 (2003)
Google Scholar
Widmer, G., Kubat, M.: Learning in the presence of concept drift and hidden contexts. Machine Learning 23, 69–101 (1996)
Google Scholar
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Fisher, D.H. (ed.) Proceedings of ICML 1997, 14th International Conference on Machine Learning, Nashville, US, pp. 412–420. Morgan Kaufmann, San Francisco (1997)
Google Scholar
Forman, G.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, 1289–1305 (2003)
Article MATH Google Scholar
Larkey, L.S.: A patent search and classification system. In: Fox, E.A., Rowe, N. (eds.) Proceedings of DL 1999, 4th ACM Conference on Digital Libraries, Berkeley, US, pp. 179–187. ACM Press, New York (1999)
Chapter Google Scholar
Schapire, R.E., Singer, Y., Singhal, A.: Boosting and Rocchio applied to text filtering. In: Croft, W.B., Moffat, A., van Rijsbergen, C.J., Wilkinson, R., Zobel, J. (eds.) Proceedings of SIGIR 1998, 21st ACM International Conference on Research and Development in Information Retrieval, Melbourne, AU, pp. 215–223. ACM Press, New York (1998)
Chapter Google Scholar
Fung, G.: The disputed federalist papers: Svm feature selection via concave minimization. In: TAPIA 2003: Proceedings of the 2003 conference on Diversity in computing, pp. 42–46. ACM Press, New York (2003)
Chapter Google Scholar
Peng, X., Choi, B.: Automatic web page classification in a dynamic and hierarchical way. In: IEEE International Conference on Data Mining, pp. 386–393 (2002)
Google Scholar
Clark, J., Koprinska, I., Poon, J.: A neural network based approach to automated e-mail classification. In: IEEE/WIC International Conference on Web Intelligence (WI 2003), pp. 702–705 (2003)
Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34, 1–47 (2002)
Article MathSciNet Google Scholar
Lewis, D.D.: An evaluation of phrasal and clustered representations on a text categorization task. In: SIGIR 1992: Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 37–50. ACM Press, New York (1992)
Chapter Google Scholar
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24, 513–523 (1988)
Article Google Scholar
John, G.H., Langley, P.: Estimating continuous distributions in Bayesian classifiers. In: Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence, pp. 338–345. Morgan Kaufmann, San Francisco (1995)
Google Scholar
Witten, I., Frank, E.: Data Mining: Practical machine learning tools with Java implementations. Morgan Kaufmann, San Francisco (1999)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Informatics, Aristotle University of Thessaloniki, 54124, Thessaloniki, Greece
Ioannis Katakis, Grigorios Tsoumakas & Ioannis Vlahavas

Authors

Ioannis Katakis
View author publications
You can also search for this author in PubMed Google Scholar
Grigorios Tsoumakas
View author publications
You can also search for this author in PubMed Google Scholar
Ioannis Vlahavas
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer and Communication Engineering, University of Thessaly, Glavani 37, 382 21, Volos, Greece
Panayiotis Bozanis
Department of Computer and Communication Engineering, University of Thessaly, 382 21, Volos, Greece
Elias N. Houstis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Katakis, I., Tsoumakas, G., Vlahavas, I. (2005). On the Utility of Incremental Feature Selection for the Classification of Textual Data Streams. In: Bozanis, P., Houstis, E.N. (eds) Advances in Informatics. PCI 2005. Lecture Notes in Computer Science, vol 3746. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11573036_32

Download citation

DOI: https://doi.org/10.1007/11573036_32
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29673-7
Online ISBN: 978-3-540-32091-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

On the Utility of Incremental Feature Selection for the Classification of Textual Data Streams

Abstract

Access this chapter

Preview

Similar content being viewed by others

A comparative study of feature selection methods for binary text streams classification

On Dynamic Feature Weighting for Feature Drifting Data Streams

Discriminant Analysis on a Stream of Features

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

On the Utility of Incremental Feature Selection for the Classification of Textual Data Streams

Abstract

Access this chapter

Preview

Similar content being viewed by others

A comparative study of feature selection methods for binary text streams classification

On Dynamic Feature Weighting for Feature Drifting Data Streams

Discriminant Analysis on a Stream of Features

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation