Abstract
Classification of unbalanced datasets is a critical task that is getting interest due to its relevance in many contexts and especially in the industrial one where machine faults, quality deviations belong to the class of rare events whose identification is fundamental. This work introduces and outlines the main themes related to this problem including an analysis of the factors that make the detection of unfrequent events complicated, a list of the metrics used for classifiers assessment and a review of most popular and emerging approaches used for facing class unbalance with a special focus on the detection of rare events.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Batista, G., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor. Newsl. 6(1), 20–29 (2004)
Borselli, A., Colla, V., Vannucci, M., Veroli, M.: A fuzzy inference system applied to defect detection in flat steel production (2010)
Cateni, S., Colla, V., Vannucci, M.: A method for resampling imbalanced datasets in binary classification tasks for real-world problems. Neurocomputing 135, 32–41 (2014)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16(1), 321–357 (2002)
Chawla, N.V., Lazarevic, A., Hall, L.O., Bowyer, K.W.: SMOTEBoost: improving prediction of the minority class in boosting. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) PKDD 2003. LNCS (LNAI), vol. 2838, pp. 107–119. Springer, Heidelberg (2003)
Chawla, N.: C4.5 and imbalanced data sets: investigating the effect of sampling method, probabilistic estimate, and decision tree structure. In: Proceedings of ICML03 Workshop on Class Imbalances (2003)
Estabrooks, A., Jo, T., Japkowicz, N.: A multiple resampling method for learning from imbalanced data sets. Comput. Intell. 20(1), 18–36 (2004)
Fan, W., Stolfo, S.J., Zhang, J., Chan, P.K.: Adacost: misclassification cost-sensitive boosting. In: Proceedings of the Sixteenth International Conference on Machine Learning, ICML 1999, pp. 97–105. Morgan Kaufmann Publishers Inc., San Francisco (1999)
García-Pedrajas, N., Ortiz-Boyer, D., García-Pedrajas, M.D., Fyfe, C.: Class imbalance methods for translation initiation site recognition. In: García-Pedrajas, N., Herrera, F., Fyfe, C., Benítez, J.M., Ali, M. (eds.) IEA/AIE 2010, Part I. LNCS, vol. 6096, pp. 327–336. Springer, Heidelberg (2010)
He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)
Japkowicz, N.: The class imbalance problem: significance and strategies. In: Proceedings of the 2000 International Conference on Artificial Intelligence, ICAI, pp. 111–117 (2000)
Japkowicz, N.: Concept-learning in the presence of \( Between-Class\) and \( Within-Class\) imbalances. In: Stroulia, E., Matwin, S. (eds.) Canadian AI 2001. LNCS (LNAI), vol. 2056, pp. 67–77. Springer, Heidelberg (2001)
Japkowicz, N., Myers, C., Gluck, M.: A novelty detection approach to classification. In: Proceedings of 14th International Joint Conference on Artificial Intelligence, pp. 518–523 (1995)
Japkowicz, N., Stephen, S.: The class imbalance problem: a systematic study. Intell. Data Anal. 6(5), 429–449 (2002)
Joshi, M., Kumar, V., Agarwal, R.: Evaluating boosting algorithms to classify rare classes: comparison and improvements, pp. 257–264 (2001)
Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: one-sided selection. In: Proceedings of the Fourteenth International Conference on Machine Learning, pp. 179–186. Morgan Kaufmann (1997)
Laurikkala, J.: Improving identification of difficult small classes by balancing class distribution. In: Quaglini, S., Barahona, P., Andreassen, S. (eds.) AIME 2001. LNCS (LNAI), vol. 2101, pp. 63–66. Springer, Heidelberg (2001)
Ling, C.X., Yang, Q., Wang, J., Zhang, S.: Decision trees with minimal costs. In: Proceedings of the Twenty-first International Conference on Machine Learning, ICML 2004, p. 69. ACM, New York (2004)
Liu, Y., Chawla, N., Harper, M., Shriberg, E., Stolcke, A.: A study in machine learning from imbalanced data for sentence boundary detection in speech. Comput. Speech Lang. 20(4), 468–494 (2006)
Maheta, H.H., Dabhi, V.K.: Classification of imbalanced data sets using multi objective genetic programming. In: 2015 International Conference on Computer Communication and Informatics (ICCCI), pp. 1–6, January 2015
Schölkopf, B., Smola, A.J., Williamson, R.C., Bartlett, P.L.: New support vector algorithms. Neural Comput. 12(5), 1207–1245 (2000)
Soda, P.: A multi-objective optimisation approach for class imbalance learning. Pattern Recogn. 44(8), 1801–1810 (2011)
Vannucci, M., Colla, V.: Novel classification method for sensitive problems and uneven datasets based on neural networks and fuzzy logic. Appl. Soft Comput. J. 11(2), 2383–2390 (2011)
Vannucci, M., Colla, V., Nastasi, G., Matarese, N.: Detection of rare events within industrial datasets by means of data resampling and specific algorithms. Int. J. Simul. Syst. Sci. Technol. 11(3), 1–11 (2010)
Vannucci, M., Colla, V., Sgarbi, M., Toscanelli, O.: Thresholded neural networks for sensitive industrial classification tasks. In: Cabestany, J., Sandoval, F., Prieto, A., Corchado, J.M. (eds.) IWANN 2009. LNCS, vol. 5517, pp. 1320–1327. Springer, Heidelberg (2009)
Vannucci, M., Colla, V., Vannocci, M., Reyneri, L.: Dynamic resampling method for classification of sensitive problems and uneven datasets. In: Greco, S., Bouchon-Meunier, B., Coletti, G., Fedrizzi, M., Matarazzo, B., Yager, R.R. (eds.) IPMU 2012. CCIS, vol. 298, pp. 78–87. Springer, Heidelberg (2012)
Vannucci, M., Colla, V.: Smart under-Sampling for the detection of rare patterns in unbalanced datasets. Springer International Publishing, Cham (2016)
Wang, S., Yao, X.: Diversity analysis on imbalanced data sets by using ensemble models. In: IEEE Symposium on Computational Intelligence and Data Mining, CIDM 2009, pp. 324–331, March 2009
Weiss, G.M.: Mining with rarity: a unifying framework. SIGKDD Explor. Newsl. 6(1), 7–19 (2004)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Vannucci, M., Colla, V. (2016). Classification of Unbalanced Datasets and Detection of Rare Events in Industry: Issues and Solutions. In: Jayne, C., Iliadis, L. (eds) Engineering Applications of Neural Networks. EANN 2016. Communications in Computer and Information Science, vol 629. Springer, Cham. https://doi.org/10.1007/978-3-319-44188-7_26
Download citation
DOI: https://doi.org/10.1007/978-3-319-44188-7_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-44187-0
Online ISBN: 978-3-319-44188-7
eBook Packages: Computer ScienceComputer Science (R0)