Abstract
Learning from imbalanced data, where the number of observations in one class is significantly larger than the ones in the other class, has gained considerable attention in the machine learning community. Assuming the difficulty in predicting each class is similar, most standard classifiers will tend to predict the majority class well. This study applies tornado data that are highly imbalanced, as they are rare events. The severe weather data used herein have thunderstorm circulations (mesocyclones) that produce tornadoes in approximately 6.7 % of the total number of observations. However, since tornadoes are high impact weather events, it is important to predict the minority class with high accuracy. In this study, we apply support vector machines (SVMs) and logistic regression with and without a midpoint threshold adjustment on the probabilistic outputs, random forest, and rotation forest for tornado prediction. Feature selection with SVM-recursive feature elimination was also performed to identify the most important features or variables for predicting tornadoes. The results showed that the threshold adjustment on SVMs provided better performance compared to other classifiers.





Similar content being viewed by others
References
Bi J, Bennett KP, Embrechts M, Breneman CM, Song M (2003) Dimensionality reduction via sparse support vector machines. J Mach Learn Res 3:1229–1243
Bluestein HB (1993) Synoptic-dynamic meteorology in midlatitudes: volume II: observations and theory of weather systems. Oxford University Press, New York
Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classifiers. Paper presented at the Proceedings of the fifth annual workshop on computational learning theory, Pittsburgh, Pennsylvania, US
Breiman L (2001) Random Forests. Mach Learn 45(1):5–32. doi:10.1023/a:1010933404324
Cárdenas AA, Baras JS (2006) B-ROC curves for the assessment of classifiers over imbalanced data sets. In: Proceedings of the 21st national conference on artificial intelligence (AAAI 06), Boston, Massachusetts, July 16–20, 2006
Donaldson RJ, Dyer RM, Krauss MJ (1975) An objective evaluator of techniques for predicting severe weather events. In: Ninth conference on severe local storms, Norman, OK, 1975. American Meteorological Society, pp 321–326
Drummond C, Holte RC (2003) C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: Workshop on learning from imbalanced data sets II, ICML, Washington, DC, 2003
Efron B, Tibshirani R (1993) An introduction to the bootstrap. In: Monographs on statistics and applied probability, vol 57. Chapman & Hall, New York
Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1–3):389–422. doi:10.1023/a:1012487302797
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. SIGKDD Explor Newsl 11(1):10–18. doi:10.1145/1656274.1656278
Hand DJ, Mannila H, Smyth P (2001) Principles of data mining. In: Adaptive computation and machine learning. MIT Press, Cambridge
Heidke P (1926) Berechnung des erfolges und der gute der windstarkvorhersagen im sturmwarnungsdienst. Geografiska Annaler 8:301–349
Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: Proceedings of 14th international conference on machine learning, 1997. Morgan Kaufmann, Los Altos, pp 179–186. citeulike-article-id:2526066
Lakshmanan V, Stumpf G, Witt A (2005) A neural network for detecting and diagnosing tornadic circulations using the mesocyclone detection and near storm environment algorithms. In: 21st international conference on information processing systems, San Diego, CA, 2005. p J5.2
Marzban C, Stumpf GJ (1996) A neural network for tornado prediction based on Doppler radar-derived attributes. J Appl Meteorol 35(5):617–626
McGill R, Tukey JW, Larsen WA (1978) Variations of box plots. Am Stat 32(1):12–16. doi:10.2307/2683468
Platt J (1999) Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: Smola A, PB, Schölkopf B, Schuurmans D (ed) Advances in large margin classifiers. pp 61–74. citeulike-article-id:3115812
Provost F, Fawcett T (2001) Robust classification for imprecise environments. Mach Learn 42(3):203–231
Provost FJ, Fawcett T, Kohavi R (1998) The case against accuracy estimation for comparing induction algorithms. Paper presented at the proceedings of the fifteenth international conference on machine learning
Richman MB (1986) Rotation of principal components. J Climatol 6(3):293–335
Rodriguez JJ, Kuncheva LI, Alonso CJ (2006) Rotation forest: a new classifier ensemble method. IEEE Trans Pattern Anal Mach Intell 28(10):1619–1630. doi:10.1109/TPAMI.2006.211
Roebber PJ (2009) Visualizing multiple measures of forecast quality. Weather Forecast 24:601–608
Stumpf GJ, Witt A, Mitchell ED, Spencer PL, Johnson JT, Eilts MD, Thomas KW, Burgess DW (1998) The national severe storms laboratory mesocyclone detection algorithm for the WSR-88D. Weather Forecast 13(2):304–326
Trafalis TB, Ince H, Richman MB (2003) Tornado detection with support vector machines. Paper presented at the proceedings of the (2003) international conference on computational science. Melbourne, Australia
Trafalis TB, Santosa B, Richman MB (2004) Bayesian neural networks for tornado detection. WSEAS Trans Syst 3:3211–3216
Trafalis TB, Santosa B, Richman MB (2005) Learning networks for tornado forecasting: a Bayesian perspective. WIT Trans Inf Commun Technol 35:5–14
Vapnik VN (1998) Statistical learning theory. In: Adaptive and learning systems for signal processing, communications, and control. Wiley, New York
Wilks D (1995) Statistical methods in atmospheric sciences. Academic Press, San Diego
Yang JH, Honavar V (1998) Feature subset selection using a genetic algorithm. IEEE Intell Syst App 13(2):44–49. doi:10.1109/5254.671091
Acknowledgments
Funding for this research was provided under the National Science Foundation Grants AGS0831359 and EIA-0205628.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Trafalis, T.B., Adrianto, I., Richman, M.B. et al. Machine-learning classifiers for imbalanced tornado data. Comput Manag Sci 11, 403–418 (2014). https://doi.org/10.1007/s10287-013-0174-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10287-013-0174-6
Keywords
- Machine learning
- Support vector machines
- Random forest
- Rotation forest
- Logistic regression
- Tornado detection