Skip to main content

Part of the book series: Advances in Intelligent and Soft Computing ((AINSC,volume 93))

Abstract

Imbalanced data is a well-known common problem in many practical applications of machine learning and its effects on the performance of standard classifiers are remarkable. In this paper we investigate if the classification of Medline documents using MeSH controlled vocabulary poses additional challenges when dealing with class-imbalanced prediction. For this task, we evaluate the performance of Bayesian networks by using some available strategies to overcome the effect of class imbalance. Our results show both that Bayesian network classifiers are sensitive to class imbalance and existing techniques can improve their overall performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Japkowicz, N., Stephen, S.: The class imbalance problem: A systematic study. Intell. Data Anal. 6(5), 429 (2002)

    MATH  Google Scholar 

  2. Van Hulse, J., Khoshgoftaar, T.: Knowledge discovery from imbalanced and noisy data. Data & Knowledge Engineering 68, 1513–1542 (2009)

    Article  Google Scholar 

  3. He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering 21(9), 1263–1284 (2009)

    Article  Google Scholar 

  4. Dumais, S.T., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In: Proceedings of CIKM 1998, 7th ACM International Conference on Information and Knowledge Management, Bethesda, MD (1998)

    Google Scholar 

  5. Lam, W., Low, K.F., Ho, C.Y.: Using a Bayesian network induction approach for text categorization. In: Proceedings of IJCAI 1997, 15th International Joint Conference on Artificial Intelligence, Nagoya, Japan (1997)

    Google Scholar 

  6. Yu, T., Jan, T., Simoff, S., Debeham, J.: A hierarchical VQSVM for imbalanced data sets. In: Proceedings of the International Joint Conference on Neural Networks, Orlando, Florida (2007)

    Google Scholar 

  7. Zhu, X.: Lazy Bagging for Classifying Imbalanced Data. In: Proceedings of the 7th IEEE International Conference on Data Mining, Omaha NE, USA (2007)

    Google Scholar 

  8. Forman, G.: An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research 3, 1289–1305 (2003)

    Article  MATH  Google Scholar 

  9. Chen, X., Wasikowski, M.: FAST: A ROC-based feature selection metric for small samples and imbalanced data classification problems. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas (2008)

    Google Scholar 

  10. Glez-Peña, D., López, S., Pavón, R., Laza, R., Iglesias, E.L., Borrajo, L.: Classification of Medline documents using MeSH terms. In: Proceedings of the 4th International Workshop on Practical Applications of Computational Biology & Bioinformatics, Salamanca, Spain (2009)

    Google Scholar 

  11. Ng, W., Dash, M.: An evaluation of progressive sampling for imbalanced data sets. In: Proceedings of the 6th IEEE International Conference on Data Mining – Workshops, Hong Kong, China (2006)

    Google Scholar 

  12. Yen, S.J., Lee, Y.S., Lin, C.H., Ying, J.C.: Investigating the effect of sampling methods for imbalanced data distributions. In: Proceedings of the 2006 IEEE International Conference on Systems, Man, and Cybernetics, Taipei, Taiwan (2006)

    Google Scholar 

  13. Japkowicz, N., Stephen, S.: The class imbalance problem: A systematic study. Intell. Data Anal. 6, 429–449 (2002)

    MATH  Google Scholar 

  14. Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: one-sided selection. In: Proceedings of the Fourteenth International Conference on Machine Learning, Nashville, Tennessee, USA (1997)

    Google Scholar 

  15. Drummond, C., Holte, R.: C4.5, class imbalance, and cost sensitivity: why undersampling beats oversampling. In: Proceedings of the ICML2003 - Workshop on Learning from Imbalanced Data Sets, Washington, DC USA (2003)

    Google Scholar 

  16. Elkan, C.: The foundations of cost-sensitive learning. In: Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, Seattle, Washington, USA (2001)

    Google Scholar 

  17. Domingos, P.: Metacost: A general method for making classifiers costsensitive. In: Proceedings of the Fifth ACM SIGKDD International Conference Knowledge Discovery and Data Mining, San Diego, CA (1999)

    Google Scholar 

  18. Zhou, Z.H., Liu, X.Y.: Training cost-sensitive neural networks with methods addressing the class imbalance problem. Proceedings of the IEEE Transactions on Knowledge and Data Engineering (2006)

    Google Scholar 

  19. Liu, X.Y., Zhou, Z.H.: The influence of class imbalance on cost-sensitive learning: an empirical study. In: Proceedings of the 6th International Conference on Data Mining, Hong Kong, China (2006)

    Google Scholar 

  20. Japkowicz, N., Myers, C., Gluck, M.: A novelty detection approach to classification. In: Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, Montreal, Quebec, Canada (1995)

    Google Scholar 

  21. Kubat, M., Holte, R., Matwin, S.: Machine learning for the detection of oil spills in satellite radar images. Machine Learning 30, 195–215 (1998)

    Article  Google Scholar 

  22. Molinara, M., Ricamato, M.T., Tortorella, F.: Facing imbalanced classes through aggregation of classifiers. In: Proceedings of the 14th International Conference on Image Analysis and Processing, Modena, Italy (2007)

    Google Scholar 

  23. Ertekin, S., Huang, J., Giles, C.L.: Active learning for class imbalance problem. In: Proceedings of the 21st annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Amsterdam, Netherlands (2007)

    Google Scholar 

  24. Ertekin, S., Huang, J., Bottou, L., Giles, C.L.: Learning on the border: active learning in imbalanced data classification. In: Proceedings of the ACM Sixteenth Conference on Information and Knowledge Management, Lisboa, Portugal (2007)

    Google Scholar 

  25. Cooper, G.F., Herskovits, E.: A Bayesian method for the induction of probabilistic networks from data. Machine Learning 9(4), 309–347 (1992)

    MATH  Google Scholar 

  26. Zhang, J., Mani, I.: kNN approach to unbalanced data distributions: A case study involving information extraction. In: Proceedings of the ICML 2003 workshop on learning from imbalanced datasets, Washigton DC, USA (2003)

    Google Scholar 

  27. He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In: Proceedings of the Int. Joint Conference on Neural Networks, IJCNN 2008, Hong Kong, China (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Pavón, R., Laza, R., Reboiro-Jato, M., Fdez-Riverola, F. (2011). Assessing the Impact of Class-Imbalanced Data for Classifying Relevant/Irrelevant Medline Documents. In: Rocha, M.P., Rodríguez, J.M.C., Fdez-Riverola, F., Valencia, A. (eds) 5th International Conference on Practical Applications of Computational Biology & Bioinformatics (PACBB 2011). Advances in Intelligent and Soft Computing, vol 93. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19914-1_45

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-19914-1_45

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-19913-4

  • Online ISBN: 978-3-642-19914-1

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics