Advantages of Oversampling Techniques: A Case Study in Risk Factors for Fall Prediction

Sihag, Gulshan; Yadav, Pankaj; Vijay, Vivek; Delcroix, Veronique; Siebert, Xavier; Yadav, Sandeep Kumar; Puisieux, François

doi:10.1007/978-3-031-37496-8_4

Gulshan Sihag⁸,
Pankaj Yadav⁹,
Vivek Vijay⁹,
Veronique Delcroix⁸,
Xavier Siebert¹⁰,
Sandeep Kumar Yadav⁹ &
…
François Puisieux¹¹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1856))

Included in the following conference series:

International Conference on Information and Communication Technologies for Ageing Well and e-Health
International Conference on Information and Communication Technologies for Ageing Well and e-Health

Abstract

The evaluation of risk factors for falls (RFF) is a key point in fall prevention for the elderly. Since the information of the main actionable RFF can not always be regularly re-evaluated by medical factors, their automatic prediction would allow providing useful recommendations to reduce the risk of falls. This article explores the advantages of three oversampling methods to improve the quality of the prediction of 12 target RFF on the basis of a real imbalanced data set. We first present the data set, together with the selection of 45 variables and 12 target variables and other pre-processing steps. Second, we present the three oversampling methods, SMOTE, SMOTE-SVM, and ADASYN, the classifiers (Logistic Regression, Random Forest, Bayesian Network, Artificial Neural Network, and Naive Bayes), and the quality measures that we use in this study (balanced accuracy, area under ROC curve, area under Precision-Recall curve, F1 and F2 score). Each target is successively evaluated from all other variables. Results are presented by the classifier (averaging over targets) and by target (averaging over classifiers), for each oversampling method and quality measure. Finally, statistical tests validate the interest of using oversampling methods. The three methods demonstrate a clear advantage in comparison with the imbalanced data set, and SVM-SMOTE provides the best increment.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Alasadi, S.A., Bhaya, W.S.: Review of data preprocessing techniques in data mining. J. Eng. Appl. Sci. 12(16), 4102–4107 (2017)
Google Scholar
Apsemidis, A., Psarakis, S.: Support vector machines: a review and applications in statistical process monitoring. Data Anal. Appl. 3: Comput. Classif. Financ. Stat. Stochastic Methods 5, 123–144 (2020)
Google Scholar
Azar, A.T., Elshazly, H.I., Hassanien, A.E., Elkorany, A.M.: A random forest classifier for lymph diseases. Comput. Methods Programs Biomed. 113(2), 465–473 (2014)
Article Google Scholar
Cahyana, N., Khomsah, S., Aribowo, A.S.: Improving imbalanced dataset classification using oversampling and gradient boosting. In: 2019 5th International Conference on Science in Information Technology (ICSITech), pp. 217–222. IEEE (2019)
Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Article MATH Google Scholar
Cheng, J., Greiner, R.: Comparing Bayesian network classifiers. arXiv preprint arXiv:1301.6684 (2013)
Chu, X., Ilyas, I.F., Krishnan, S., Wang, J.: Data cleaning: overview and emerging challenges. In: Proceedings of the 2016 International Conference on Management of Data, pp. 2201–2206 (2016)
Google Scholar
Delcroix, V., Essghaier, F., Oliveira, K., Pudlo, P., Gaxatte, C., Puisieux, F.: Towards a fall prevention system design by using ontology. En lien avec les Journées francophones d’Ingénierie des Connaissances, Plate-Forme PFIA (2019)
Google Scholar
Francis, S., Prasad, P., Zahoor-Ul-Huq, s.: Medical data classification based on smote and recurrent neural network. Int. J. Eng. Adv. Technol. 9 (2020). https://doi.org/10.35940/ijeat.C5444.029320
He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pp. 1322–1328. IEEE (2008)
Google Scholar
He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)
Article Google Scholar
Hosmer, D.W., Jr., Lemeshow, S., Sturdivant, R.X.: Applied Logistic Regression, vol. 398. Wiley, Hoboken (2013)
Book MATH Google Scholar
Huang, X., Shi, L., Suykens, J.A.: Support vector machine classifier with pinball loss. IEEE Trans. Pattern Anal. Mach. Intell. 36(5), 984–997 (2013)
Article Google Scholar
Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Techniques. MIT Press, Cambridge (2009)
MATH Google Scholar
Kotsiantis, S.B., Kanellopoulos, D., Pintelas, P.E.: Data preprocessing for supervised leaning. Int. J. Comput. Sci. 1(2), 111–117 (2006)
Google Scholar
Lin, J.T., Lane, J.M.: Falls in the elderly population. Phys. Med. Rehabil. Clin. 16(1), 109–128 (2005)
Article Google Scholar
Nalepa, J., Kawulok, M.: Selecting training sets for support vector machines: a review. Artif. Intell. Rev. 52(2), 857–900 (2019)
Article Google Scholar
Obiedat, R., et al.: Sentiment analysis of customers’ reviews using a hybrid evolutionary SVM-based approach in an imbalanced data distribution. IEEE Access 10, 22260–22273 (2022)
Article Google Scholar
Rahman, M.M., Davis, D.N.: Machine learning-based missing value imputation method for clinical datasets. In: Yang, G.C., Ao, S., Gelman, L. (eds.) IAENG Transactions on Engineering Technologies. Lecture Notes in Electrical Engineering, vol. 229, pp. 245–257. Springer, Dordrecht (2013). https://doi.org/10.1007/978-94-007-6190-2_19
Chapter Google Scholar
Rish, I., et al.: An empirical study of the naive bayes classifier. In: IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, vol. 3, pp. 41–46 (2001)
Google Scholar
Russell, S., Norvig, P.: Artificial intelligence: a modern approach (2002)
Google Scholar
Sihag, G., et al.: Evaluation of risk factors for fall in elderly using Bayesian networks: a case study. Comput. Methods Program. Biomed. Update 1, 100035 (2021)
Google Scholar
Sihag., G., et al.: Evaluation of risk factors for fall in elderly people from imbalanced data using the oversampling technique smote. In: Proceedings of the 8th International Conference on Information and Communication Technologies for Ageing Well and e-Health - ICT4AWE, pp. 50–58. INSTICC, SciTePress (2022). https://doi.org/10.5220/0011041200003188
Sokolova, M., Japkowicz, N., Szpakowicz, S.: Beyond accuracy, F-score and ROC: a family of discriminant measures for performance evaluation. In: Sattar, A., Kang, B. (eds.) AI 2006. LNCS (LNAI), vol. 4304, pp. 1015–1021. Springer, Heidelberg (2006). https://doi.org/10.1007/11941439_114
Chapter Google Scholar
Wu, T.K., Huang, S.C., Meng, Y.R.: Evaluation of ANN and SVM classifiers as predictors to the diagnosis of students with learning disabilities. Expert Syst. Appl. 34(3), 1846–1856 (2008)
Article Google Scholar
Zhang, S., Li, X., Zong, M., Zhu, X., Cheng, D.: Learning k for KNN classification. ACM Trans. Intell. Syst. Technol. (TIST) 8(3), 1–19 (2017)
Google Scholar
Zheng, X.: SMOTE variants for imbalanced binary classification: heart disease prediction. University of California, Los Angeles (2020)
Google Scholar

Download references

Author information

Authors and Affiliations

Univ. Polytechnique Hauts-de-France, CNRS, UMR 8201 - LAMIH, 59313, Valenciennes, France
Gulshan Sihag & Veronique Delcroix
Department of Mathematics, Indian Institute of Technology, Jodhpur, Jodhpur, India
Pankaj Yadav, Vivek Vijay & Sandeep Kumar Yadav
Faculté polytechnique Département de Mathématique et Recherche Opérationnelle, Univ. de Mons, Mons, Belgium
Xavier Siebert
Dépt. de Gérontologie, Hôpital Universitaire de Lille, 59037, Lille Cedex, France
François Puisieux

Authors

Gulshan Sihag
View author publications
You can also search for this author in PubMed Google Scholar
Pankaj Yadav
View author publications
You can also search for this author in PubMed Google Scholar
Vivek Vijay
View author publications
You can also search for this author in PubMed Google Scholar
Veronique Delcroix
View author publications
You can also search for this author in PubMed Google Scholar
Xavier Siebert
View author publications
You can also search for this author in PubMed Google Scholar
Sandeep Kumar Yadav
View author publications
You can also search for this author in PubMed Google Scholar
François Puisieux
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gulshan Sihag .

Editor information

Editors and Affiliations

Wrocław University of Economics Institute of Business Informatics and Macquarie University,, Wroclaw, Poland
Leszek A. Maciaszek
University of Ulster, Newtownabbey, UK
Maurice D. Mulvenna
RWTH Aachen University, Aachen, Germany
Martina Ziefle

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sihag, G. et al. (2023). Advantages of Oversampling Techniques: A Case Study in Risk Factors for Fall Prediction. In: Maciaszek, L.A., Mulvenna, M.D., Ziefle, M. (eds) Information and Communication Technologies for Ageing Well and e-Health. ICT4AWE ICT4AWE 2021 2022. Communications in Computer and Information Science, vol 1856. Springer, Cham. https://doi.org/10.1007/978-3-031-37496-8_4

Download citation

DOI: https://doi.org/10.1007/978-3-031-37496-8_4
Published: 14 July 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-37495-1
Online ISBN: 978-3-031-37496-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Advantages of Oversampling Techniques: A Case Study in Risk Factors for Fall Prediction