ABSTRACT
Learning from imbalanced data has been an ongoing hot research area. By applying techniques for handling imbalanced data, machine learning or statistical models can significantly improve their prediction performance and mitigate bias, leading to more reliable and unbiased results. Data used to predict the fatality rate of car accidents is derived from various sources, including information at the person, vehicle, and collision levels. These data are typically imbalanced, and studying this type of data is highly desirable in improving road safety. Also, predicting a fatal event is crucial for better management and allocation of limited health resources. This study explores the impact of imbalanced data handling techniques on linear statistical models.The study illustrates the significant specificity improvement when imbalanced data is appropriately managed. The findings of this study provide valuable guidelines for health resource management, illuminating the influence of data imbalance on prediction accuracy and offering insights to improve the performance of predicting auto collision fatalities.
- Mohamed Bekkar and Taklit Akrouf Alitouche. 2013. Imbalanced data learning approaches review. International Journal of Data Mining & Knowledge Management Process 3, 4 (2013), 15.Google ScholarCross Ref
- Nitesh V Chawla. 2010. Data mining for imbalanced datasets: An overview. Data mining and knowledge discovery handbook (2010), 875–886.Google Scholar
- Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. 2002. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research 16 (2002), 321–357.Google ScholarCross Ref
- Veronikha Effendy, ZK Abdurahman Baizal, 2014. Handling imbalanced data in customer churn prediction using combined sampling and weighted random forest. In 2014 2nd International Conference on Information and Communication Technology (ICoICT). IEEE, 325–330.Google ScholarCross Ref
- Mohammad Abdul Haque Farquad and Indranil Bose. 2012. Preprocessing unbalanced data using support vector machine. Decision Support Systems 53, 1 (2012), 226–233.Google ScholarDigital Library
- Sara Fotouhi, Shahrokh Asadi, and Michael W Kattan. 2019. A comprehensive data level analysis for cancer diagnosis on imbalanced data. Journal of biomedical informatics 90 (2019), 103089.Google ScholarDigital Library
- Guo Haixiang, Li Yijing, Jennifer Shang, Gu Mingyun, Huang Yuanyue, and Gong Bing. 2017. Learning from class-imbalanced data: Review of methods and applications. Expert systems with applications 73 (2017), 220–239.Google Scholar
- Amira Kamil Ibrahim Hassan and Ajith Abraham. 2016. Modeling insurance fraud detection using imbalanced data classification. In Advances in Nature and Biologically Inspired Computing: Proceedings of the 7th World Congress on Nature and Biologically Inspired Computing (NaBIC2015) in Pietermaritzburg, South Africa, held December 01-03, 2015. Springer, 117–127.Google ScholarCross Ref
- Chuanxia Jian, Jian Gao, and Yinhui Ao. 2016. A new sampling method for classifying imbalanced data based on support vector machine ensemble. Neurocomputing 193 (2016), 115–122.Google ScholarDigital Library
- Harsurinder Kaur, Husanbir Singh Pannu, and Avleen Kaur Malhi. 2019. A systematic review on imbalanced data challenges in machine learning: Applications and solutions. ACM Computing Surveys (CSUR) 52, 4 (2019), 1–36.Google Scholar
- Vojislav Kecman. 2005. Support vector machines–an introduction. In Support vector machines: theory and applications. Springer, 1–47.Google Scholar
- Bartosz Krawczyk. 2016. Learning from imbalanced data: open challenges and future directions. Progress in Artificial Intelligence 5, 4 (2016), 221–232.Google ScholarCross Ref
- Na Liu, Xiaomei Li, Ershi Qi, Man Xu, Ling Li, and Bo Gao. 2020. A novel ensemble learning paradigm for medical diagnosis with imbalanced data. IEEE Access 8 (2020), 171263–171280.Google ScholarCross Ref
- Maher Maalouf, Dirar Homouz, and Theodore B Trafalis. 2018. Logistic regression in large rare events and imbalanced data: A performance comparison of prior correction and weighting methods. Computational Intelligence 34, 1 (2018), 161–174.Google ScholarCross Ref
- Maher Maalouf and Mohammad Siddiqi. 2014. Weighted logistic regression for large-scale imbalanced and rare events data. Knowledge-Based Systems 59 (2014), 142–148.Google ScholarDigital Library
- Maher Maalouf and Theodore B Trafalis. 2011. Rare events and imbalanced datasets: an overview. International Journal of Data Mining, Modelling and Management 3, 4 (2011), 375–388.Google ScholarCross Ref
- Benjamin X Wang and Nathalie Japkowicz. 2010. Boosting support vector machines for imbalanced data sets. Knowledge and information systems 25 (2010), 1–20.Google Scholar
- Shengkun Xie and Jin Zhang. 2022. A Novel Variable Selection Approach Based on Multi-criteria Decision Analysis. In Information Processing and Management of Uncertainty in Knowledge-Based Systems: 19th International Conference, IPMU 2022, Milan, Italy, July 11–15, 2022, Proceedings, Part II. Springer, 115–127.Google ScholarCross Ref
Index Terms
- Handling Data Imbalance In Linear Modelling of Fatality Rate of Auto Collision
Recommendations
Modeling of class imbalance using an empirical approach with spambase dataset and random forest classification
RIIT '14: Proceedings of the 3rd annual conference on Research in information technologyClassification of imbalanced data is an important research problem as most of the data encountered in real world systems is imbalanced. Recently a representation learning technique called Synthetic Minority Over-sampling Technique (SMOTE) has been ...
A Novel Distribution Analysis for SMOTE Oversampling Method in Handling Class Imbalance
Computational Science – ICCS 2019AbstractClass Imbalance problems are often encountered in many applications. Such problems occur whenever a class is under-represented, has a few data points, compared to other classes. However, this minority class is usually a significant one. One ...
Over-sampling via under-sampling in strongly imbalanced data
Classification of imbalanced datasets is an important challenge in machine learning. This investigation analysed the effect of ratio imbalance and the selected classifier on the application of several re-sampling strategies to deal with imbalanced ...
Comments