Abstract
Data skewness is a challenge encountered, in particular, when applying supervised machine learning approaches in various domains, such as in healthcare and biomedical information engineering. Evidence Based Medicine (EBM) is a clinical strategy for prescribing treatment based on current best evidence for individual patients. Clinicians need to query publication repositories in order to find the best evidence to support their decision-making processes. This sophisticated information is materialised in the form of scientific artefacts in scholarly publications and the automatic extraction of these artefacts is a technical challenge for current generic search engines. Many classification approaches have been proposed for identifying key scientific artefacts in EBM, however their performance is affected by the imbalanced characteristic of data in this domain. In this paper, we present four data balancing approaches applied in a binary ensemble classifier framework for classifying scientific artefacts in the EBM domain. Our balancing approaches improve the ensemble classifier’s F-score by up to 15% for classes of scientific artefacts with extremely low coverage in the domain. In addition, we propose a classifier selection method for choosing the best classifier based on the distributional feature of classes. The resulting classifiers show improved classification performances when compared to state of the art approaches.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Bunkhumpornpat, C., Sinapiromsaran, K., Lursinsap, C.: Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS (LNAI), vol. 5476, pp. 475–482. Springer, Heidelberg (2009)
Chawla, N.V.: Data Mining for Imbalanced Datasets: An Overview. In: Data Mining and Knowledge Discovery Handbook, 2nd edn., pp. 875–886 (2010)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16, 321–357 (2002)
de Souto, M.C.P., Bittencourt, V.G., Costa, J.A.F.: An empirical analysis of under-sampling techniques to balance a protein structural class dataset. In: King, I., Wang, J., Chan, L.-W., Wang, D. (eds.) ICONIP 2006. Part III, LNCS, vol. 4234, pp. 21–29. Springer, Heidelberg (2006)
Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches. IEEE Transactions on Systems Man and Cybernetics Part C-Applications and Reviews 42(4), 463–484 (2012)
Hassanzadeh, H., Groza, T., Hunter, J.: Identifying scientific artefacts in biomedical literature: The Evidence Based Medicine use case. J. Biomed. Inform. 49, 159–170 (2014)
Khalilia, M., Chakraborty, S., Popescu, M.: Predicting disease risks from highly imbalanced data using random forest. BMC Med. Inform. Decis. Mak. 11 (2011)
Kim, S.N., Martinez, D., Cavedon, L., Yencken, L.: Automatic classification of sentences to support Evidence Based Medicine. BMC Bioinformatics 12(suppl. 2), S5 (2011)
Liakata, M., Saha, S., Dobnik, S., Batchelor, C., Rebholz-Schuhmann, D.: Automatic recognition of conceptualization zones in scientific articles and two life science applications. Bioinformatics 28(7), 991–1000 (2012)
McCallum, A.K.: MALLET: A Machine Learning for Language Toolkit (2002), http://mallet.cs.umass.edu (retrieved)
Nakamura, M., Kajiwara, Y., Otsuka, A., Kimura, H.: LVQ-SMOTE - Learning Vector Quantization based Synthetic Minority Over-sampling Technique for biomedical data. Biodata Mining 6 (2013)
Sarker, A., Molla, D., Paris, C.: An Approach for Automatic Multi-label Classification of Medical Sentences. In: Proceedings of the 4th International Louhi Workshop on Health Document Text Mining and Information Analysis (2013)
Verbeke, M., Asch, V.V., Morante, R., Frasconi, P., Daelemans, W., Raedt, L.D.: A statistical relational learning approach to identifying evidence based medicine categories. Paper Presented at the Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Jeju Island, Korea (2012)
Yen, S.J., Lee, Y.S.: Cluster-based under-sampling approaches for imbalanced data distributions. Expert Systems with Applications 36(3) (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Hassanzadeh, H., Groza, T., Nguyen, A., Hunter, J. (2014). Load Balancing for Imbalanced Data Sets: Classifying Scientific Artefacts for Evidence Based Medicine. In: Pham, DN., Park, SB. (eds) PRICAI 2014: Trends in Artificial Intelligence. PRICAI 2014. Lecture Notes in Computer Science(), vol 8862. Springer, Cham. https://doi.org/10.1007/978-3-319-13560-1_84
Download citation
DOI: https://doi.org/10.1007/978-3-319-13560-1_84
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-13559-5
Online ISBN: 978-3-319-13560-1
eBook Packages: Computer ScienceComputer Science (R0)