Skip to main content

Experimentally Studying Progressive Filtering in Presence of Input Imbalance

  • Conference paper
Book cover Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2010)

Abstract

Progressively Filtering (PF) is a simple categorization technique framed within the local classifier per node approach. In PF, each classifier is entrusted with deciding whether the input in hand can be forwarded or not to its children. A simple way to implement PF consists of unfolding the given taxonomy into pipelines of classifiers. In so doing, each node of the pipeline is a binary classifier able to recognize whether or not an input belongs to the corresponding class. In this chapter, we illustrate and discuss the results obtained by assessing the PF technique, used to perform text categorization. Experiments, on the Reuters Corpus (RCV1- v2) dataset, are focused on the ability of PF to deal with input imbalance. In particular, the baseline is: (i) comparing the results to those calculated resorting to the corresponding flat approach; (ii) calculating the improvement of performance while augmenting the pipeline depth; and (iii) measuring the performance in terms of generalization- / specialization- / misclassification-error and unknown-ratio. Experimental results show that, for the adopted dataset, PF is able to counteract great imbalances between negative and positive examples. We also present and discuss further experiments aimed at assessing TSA, the greedy threshold selection algorithm adopted to perform PF, against a relaxed brute-force algorithm and the most relevant state-of-the-art algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Addis, A., Armano, G., Vargiu, E.: From a generic multiagent architecture to multiagent information retrieval systems. In: AT2AI-6, Sixth International Workshop, From Agent Theory to Agent Implementation, pp. 3–9 (2008)

    Google Scholar 

  2. Addis, A., Armano, G., Vargiu, E.: Assessing progressive filtering to perform hierarchical text categorization in presence of input imbalance. In: Proceedings of International Conference on Knowledge Discovery and Information Retrieval, KDIR 2010 (2010)

    Google Scholar 

  3. Addis, A., Armano, G., Vargiu, E.: A Comparative Experimental Assessment of a Threshold Selection Algorithm in Hierarchical Text Categorization. In: Clough, P., Foley, C., Gurrin, C., Jones, G.J.F., Kraaij, W., Lee, H., Mudoch, V. (eds.) ECIR 2011. LNCS, vol. 6611, pp. 32–42. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  4. Armano, G.: On the progressive filtering approach to hierarchical text categorization. Tech. rep., DIEE - University of Cagliari (2009)

    Google Scholar 

  5. Bellifemine, F., Caire, G., Greenwood, D. (eds.): Developing Multi-Agent Systems with JADE (Wiley Series in Agent Technology). John Wiley and Sons (2007)

    Google Scholar 

  6. Bennett, P.N., Nguyen, N.: Refined experts: improving classification in large taxonomies. In: SIGIR 2009: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pp. 11–18. ACM, New York (2009)

    Chapter  Google Scholar 

  7. Brank, J., Mladenic, D., Grobelnik, M.: Large-scale hierarchical text classification using svm and coding matrices. In: Large-Scale Hierarchical Classification Workshop (2010)

    Google Scholar 

  8. Ceci, M., Malerba, D.: Hierarchical Classification of HTML Documents with WebClassII. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 57–72. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  9. Ceci, M., Malerba, D.: Classifying web documents in a hierarchy of categories: a comprehensive study. Journal of Intelligent Information Systems 28(1), 37–78 (2007)

    Article  Google Scholar 

  10. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16, 321–357 (2002)

    MATH  Google Scholar 

  11. Cost, R.S., Salzberg, S.: A weighted nearest neighbor algorithm for learning with symbolic features. Machine Learning 10, 57–78 (1993)

    Google Scholar 

  12. D’Alessio, S., Murray, K., Schiaffino, R.: The effect of using hierarchical classifiers in text categorization. In: Proceedings of of the 6th International Conference on Recherche dInformation Assiste par Ordinateur (RIAO), pp. 302–313 (2000)

    Google Scholar 

  13. Dumais, S.T., Chen, H.: Hierarchical classification of Web content. In: Belkin, N.J., Ingwersen, P., Leong, M.K. (eds.) Proceedings of 23rd ACM International Conference on Research and Development in Information Retrieval, SIGIR 2000, pp. 256–263. ACM Press, New York (2000)

    Google Scholar 

  14. Esuli, A., Fagni, T., Sebastiani, F.: Boosting multi-label hierarchical text categorization. Inf. Retr. 11(4), 287–313 (2008)

    Article  Google Scholar 

  15. Gaussier, É., Goutte, C., Popat, K., Chen, F.: A Hierarchical Model for Clustering and Categorising Documents. In: Crestani, F., Girolami, M., van Rijsbergen, C.J.K. (eds.) ECIR 2002. LNCS, vol. 2291, pp. 229–247. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  16. Japkowicz, N.: Learning from imbalanced data sets: a comparison of various strategies. In: AAAI Workshop on Learning from Imbalanced Data Sets (2000)

    Google Scholar 

  17. Koller, D., Sahami, M.: Hierarchically classifying documents using very few words. In: Fisher, D.H. (ed.) Proceedings of 14th International Conference on Machine Learning, ICML 1997, pp. 170–178. Morgan Kaufmann Publishers, San Francisco (1997)

    Google Scholar 

  18. Kotsiantis, S., Kanellopoulos, D., Pintelas, P.: Handling imbalanced datasets: a review. GESTS International Transactions on Computer Science and Engineering 30, 25–36 (2006)

    Google Scholar 

  19. Kotsiantis, S., Pintelas, P.: Mixture of expert agents for handling imbalanced data sets. Ann Math. Comput. Teleinformatics 1, 46–55 (2003)

    Google Scholar 

  20. Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: One-sided selection. In: Proceedings of the Fourteenth International Conference on Machine Learning, pp. 179–186. Morgan Kaufmann (1997)

    Google Scholar 

  21. Lewis, D.D.: Evaluating and optimizing autonomous text classification systems. In: SIGIR 1995: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 246–254. ACM, New York (1995)

    Chapter  Google Scholar 

  22. Lewis, D.D., Yang, Y., Rose, T., Li, F.: RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research 5, 361–397 (2004)

    Google Scholar 

  23. McCallum, A.K., Rosenfeld, R., Mitchell, T.M., Ng, A.Y.: Improving text classification by shrinkage in a hierarchy of classes. In: Shavlik, J.W. (ed.) Proceedings of 15th International Conference on Machine Learning, ICML 1998, pp. 359–367. Morgan Kaufmann Publishers, San Francisco (1998)

    Google Scholar 

  24. Mladenic, D., Grobelnik, M.: Feature selection for classification based on text hierarchy. In: Text and the Web, Conference on Automated Learning and Discovery CONALD 1998 (1998)

    Google Scholar 

  25. Rousu, J., Saunders, C., Szedmak, S., Shawe-Taylor, J.: Learning hierarchical multi-category text classification models. In: ICML 2005: Proceedings of the 22nd International Conference on Machine Learning, pp. 744–751. ACM, New York (2005)

    Chapter  Google Scholar 

  26. Ruiz, M.E., Srinivasan, P.: Hierarchical text categorization using neural networks. Information Retrieval 5(1), 87–118 (2002)

    Article  MATH  Google Scholar 

  27. Ruiz, M.E.: Combining machine learning and hierarchical structures for text categorization. Ph.D. thesis, supervisor-Srinivasan, Padmini (2001)

    Google Scholar 

  28. Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys (CSUR) 34(1), 1–55 (2002)

    Article  Google Scholar 

  29. Sun, A., Lim, E.: Hierarchical text classification and evaluation. In: ICDM 2001: Proceedings of the 2001 IEEE International Conference on Data Mining, pp. 521–528. IEEE Computer Society, Washington, DC, USA (2001)

    Google Scholar 

  30. Weigend, A.S., Wiener, E.D., Pedersen, J.O.: Exploiting hierarchy in text categorization. Information Retrieval 1(3), 193–216 (1999)

    Article  Google Scholar 

  31. Wu, F., Zhang, J., Honavar, V.: Learning Classifiers using Hierarchically Structured Class Taxonomies. In: Zucker, J.-D., Saitta, L. (eds.) SARA 2005. LNCS (LNAI), vol. 3607, pp. 313–320. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  32. Wu, G., Chang, E.Y.: Class-boundary alignment for imbalanced dataset learning. In: ICML 2003 Workshop on Learning from Imbalanced Data Sets, pp. 49–56 (2003)

    Google Scholar 

  33. Takigawa, Y., Hotta, S., Kiyasu, S., Miyahara, S.: Pattern classification using weighted average patterns of categorical k-nearest neighbors. In: Proceedings of the 1th International Workshop on Camera-Based Document Analysis and Recognition, pp. 111–118 (2005)

    Google Scholar 

  34. Yan, R., Liu, Y., Jin, R., Hauptmann, A.: On predicting rare classes with svm ensembles in scene classification. In: Proceedings of 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, (ICASSP 2003), vol. 3, pp. III-21–III-4 (April 2003)

    Google Scholar 

  35. Yang, Y.: An evaluation of statistical approaches to text categorization. Information Retrieval 1(1/2), 69–90 (1999)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Addis, A., Armano, G., Vargiu, E. (2013). Experimentally Studying Progressive Filtering in Presence of Input Imbalance. In: Fred, A., Dietz, J.L.G., Liu, K., Filipe, J. (eds) Knowledge Discovery, Knowledge Engineering and Knowledge Management. IC3K 2010. Communications in Computer and Information Science, vol 272. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29764-9_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-29764-9_4

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-29763-2

  • Online ISBN: 978-3-642-29764-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics