Skip to main content

On the Utility of Incremental Feature Selection for the Classification of Textual Data Streams

  • Conference paper
Advances in Informatics (PCI 2005)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3746))

Included in the following conference series:

Abstract

In this paper we argue that incrementally updating the features that a text classification algorithm considers is very important for real-world textual data streams, because in most applications the distribution of data and the description of the classification concept changes over time. We propose the coupling of an incremental feature ranking method and an incremental learning algorithm that can consider different subsets of the feature vector during prediction (what we call a feature based classifier), in order to deal with the above problem. Experimental results with a longitudinal database of real spam and legitimate emails shows that our approach can adapt to the changing nature of streaming data and works much better than classical incremental learning algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Sahami, M., Dumais, S., Heckerman, D., Horvitz, E.: A bayesian approach to filtering junk E-mail. In: Learning for Text Categorization: Papers from the 1998 Workshop, Madison, Wisconsin, AAAI Technical Report WS-98-05 (1998)

    Google Scholar 

  2. Chandrinos, K.V., Androutsopoulos, I., Paliouras, G., Spyropoulos, C.D.: Automatic web rating: Filtering obscene content on the web. In: Borbinha, J.L., Baker, T. (eds.) Proceedings of ECDL 2000, 4th European Conference on Research and Advanced Technology for Digital Libraries, Lisbon, Portugal, pp. 403–406. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  3. Lee, P.Y., Hui, S.C., Fong, A.C.M.: Neural networks for web content filtering. IEEE Intelligent Systems 17, 48–57 (2002)

    Article  Google Scholar 

  4. Lang, K.: Newsweeder: Learning to filter netnews. In: Proceedings of the 12th International Conference on Machine Learning, Tahoe City, California, pp. 331–339 (1995)

    Google Scholar 

  5. Fetterly, D., Manasse, M., Najork, M.: Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. In: WebDB 2004: Proceedings of the 7th International Workshop on the Web and Databases, pp. 1–6. ACM Press, New York (2004)

    Chapter  Google Scholar 

  6. Fawcett, T.: “in vivo” spam filtering: A challenge problem for data mining. KDD Explorations 5 (2003)

    Google Scholar 

  7. Widmer, G., Kubat, M.: Learning in the presence of concept drift and hidden contexts. Machine Learning 23, 69–101 (1996)

    Google Scholar 

  8. Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Fisher, D.H. (ed.) Proceedings of ICML 1997, 14th International Conference on Machine Learning, Nashville, US, pp. 412–420. Morgan Kaufmann, San Francisco (1997)

    Google Scholar 

  9. Forman, G.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, 1289–1305 (2003)

    Article  MATH  Google Scholar 

  10. Larkey, L.S.: A patent search and classification system. In: Fox, E.A., Rowe, N. (eds.) Proceedings of DL 1999, 4th ACM Conference on Digital Libraries, Berkeley, US, pp. 179–187. ACM Press, New York (1999)

    Chapter  Google Scholar 

  11. Schapire, R.E., Singer, Y., Singhal, A.: Boosting and Rocchio applied to text filtering. In: Croft, W.B., Moffat, A., van Rijsbergen, C.J., Wilkinson, R., Zobel, J. (eds.) Proceedings of SIGIR 1998, 21st ACM International Conference on Research and Development in Information Retrieval, Melbourne, AU, pp. 215–223. ACM Press, New York (1998)

    Chapter  Google Scholar 

  12. Fung, G.: The disputed federalist papers: Svm feature selection via concave minimization. In: TAPIA 2003: Proceedings of the 2003 conference on Diversity in computing, pp. 42–46. ACM Press, New York (2003)

    Chapter  Google Scholar 

  13. Peng, X., Choi, B.: Automatic web page classification in a dynamic and hierarchical way. In: IEEE International Conference on Data Mining, pp. 386–393 (2002)

    Google Scholar 

  14. Clark, J., Koprinska, I., Poon, J.: A neural network based approach to automated e-mail classification. In: IEEE/WIC International Conference on Web Intelligence (WI 2003), pp. 702–705 (2003)

    Google Scholar 

  15. Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34, 1–47 (2002)

    Article  MathSciNet  Google Scholar 

  16. Lewis, D.D.: An evaluation of phrasal and clustered representations on a text categorization task. In: SIGIR 1992: Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 37–50. ACM Press, New York (1992)

    Chapter  Google Scholar 

  17. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24, 513–523 (1988)

    Article  Google Scholar 

  18. John, G.H., Langley, P.: Estimating continuous distributions in Bayesian classifiers. In: Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence, pp. 338–345. Morgan Kaufmann, San Francisco (1995)

    Google Scholar 

  19. Witten, I., Frank, E.: Data Mining: Practical machine learning tools with Java implementations. Morgan Kaufmann, San Francisco (1999)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Katakis, I., Tsoumakas, G., Vlahavas, I. (2005). On the Utility of Incremental Feature Selection for the Classification of Textual Data Streams. In: Bozanis, P., Houstis, E.N. (eds) Advances in Informatics. PCI 2005. Lecture Notes in Computer Science, vol 3746. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11573036_32

Download citation

  • DOI: https://doi.org/10.1007/11573036_32

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-29673-7

  • Online ISBN: 978-3-540-32091-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics