Data preprocessing techniques: emergence and selection towards machine learning models - a practical review using HPA dataset

Mallikharjuna Rao, K; Saikrishna, Ghanta; Supriya, Kundrapu

doi:10.1007/s11042-023-15087-5

Data preprocessing techniques: emergence and selection towards machine learning models - a practical review using HPA dataset

Published: 18 March 2023

Volume 82, pages 37177–37196, (2023)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

K Mallikharjuna Rao¹,
Ghanta Saikrishna¹ &
Kundrapu Supriya²

776 Accesses
8 Citations
1 Altmetric
Explore all metrics

Abstract

To compute the frequent metamorphosis of the housing price, the House Price Index (HPI) is one of the effective indicators. Various methodologies are involved in data processing the current house prices, which are affected by factors like house configuration, building class, air conditioning quality, etc. Remarkably, more research papers adopting classical machine learning approaches are introduced to estimate house sale prices accurately. Still, they barely regard the data processing techniques that make the data suitable for modeling more accurate house prices forecasting architectures. This research contributes to a wide variety of adequate data pre-processing. It highlights mechanisms like missingness of data, missing data handling, categorical feature encoding, discretization, outliers, and feature scaling extensively to build efficient predictive models. Comprehensive arguments have been broadly presented to portray the advantages and disadvantages of prevailed data pre-processing techniques at various distribution scenarios of variables in the house price data. The current research conclusions oblige the evolution of modern data-driven research in machine learning.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Algorithm 1

Algorithm 3

Algorithm 4

Forecasting House Price with an Optimum Set of Features

A Novel Hybrid House Price Prediction Model

Article 16 September 2022

A Critical Review on Data Preprocessing Techniques for Building Operational Data Analysis

References

Adetunji AB, Akande ON, Ajala FA, Oyewo O, Akande YF, Oluwadara G (2022) House price prediction using random forest machine learning technique. Proc Comput Sci 199:806–813
Article Google Scholar
An Efficient Joint Bayesian Model with Soft Biometric Traits for Finger Vein Recognition (n.d.)
Anand V, Mamidi V (2020) Multiple imputation of missing data in marketing. In: 2020 International Conference on Data Analytics for Business and Industry: way Towards a Sustainable Economy (Icdabi). p. 1–6. https://doi.org/10.1109/ICDABI51230.2020.9325602
Anusha PV, Chandra Murty v, Anuradha CH (2019) Detecting outliers in high dimensional datasets using Z-score methodology. Int J Innov Technol Exploring Engin 9(1):48–53
Article Google Scholar
Dahouda MK, Joe I (2021) A deep-learned embedding technique for categorical features encoding. IEEE Access 9:114381–114391. https://doi.org/10.1109/ACCESS.2021.3104357
Article Google Scholar
Doulah MS, Islam H (2019) An Alternative Robust Measure of Outlier Detection in Univariate Data Sets. April, 1–11
Dow MM, Anthon Eff E (2009) Cultural trait transmission and missing data as sources of Bias in cross-cultural survey research: explanations of polygyny re-examined. Cross-Cult Res 43(2):134–151. https://doi.org/10.1177/1069397109331612
Article Google Scholar
Emmanuel MT (2021) A survey on missing data in machine learning. Journal of Big Data, no. 140. https://doi.org/10.1186/s40537-021-00516-9
Farhangfar A, Kurgan LA, Pedrycz W (2007) A novel framework for imputation of missing values in databases. IEEE Trans Syst Man Cybern Syst Hum 37(5):692–709. https://doi.org/10.1109/TSMCA.2007.902631
Article Google Scholar
Friedman L, Komogortsev OV (2019) Assessment of the effectiveness of seven biometric feature normalization techniques. IEEE Trans Inform Foren Sec 14(10):2528–2536. https://doi.org/10.1109/TIFS.2019.2904844
Article Google Scholar
Gupta M, Gao J, Aggarwal CC, Han J (2014) Outlier detection for temporal data: a survey. IEEE Trans Knowl Data Eng 26(9):2250–2267. https://doi.org/10.1109/TKDE.2013.184
Article Google Scholar
Hacibeyoglu M, Ibrahim M (2018) EF_Unique: an improved version of unsupervised equal frequency discretization method. Arab J Sci Eng 43(March):7695–7704. https://doi.org/10.1007/s13369-018-3144-z
Article Google Scholar
He X, Min F, Zhu W (2014) Comparison of discretization approaches for granular association rule mining. Can J Electr Comput Eng 37(3):157–167. https://doi.org/10.1109/CJECE.2014.2343258
Article Google Scholar
Hien D, Thi C, Tran A, Dao S, Nguyen GC (2020) Optimize the Combination of Categorical Variable Encoding and Deep Learning Technique for the Problem of Prediction of Vietnamese Student Academic Performance. International Journal of Advanced Computer Science and Applications 11 (January). https://doi.org/10.14569/IJACSA.2020.0111135
Jadhav A, Pramod D, Ramanathan K (2019) Comparison of performance of data imputation methods for numeric dataset. Appl Artif Intell 33(10):913–933. https://doi.org/10.1080/08839514.2019.1637138
Article Google Scholar
Jose J, Vishwakarma GK, Bhattacharjee A (2021) Illustration of missing data handling technique generated from hepatitis C induced hepatocellular carcinoma cohort study. J King Saud Univ–Sci 33:101403
Article Google Scholar
Kang H (2013) The prevention and handling of the missing data. Korean J Anesthes 64(May):402–406. https://doi.org/10.4097/kjae.2013.64.5.402
Article Google Scholar
Khan SI, Hoque ASML (2020) SICE: an improved missing data imputation technique. J Big Data 7:37. https://doi.org/10.1186/s40537-020-00313-w
Article Google Scholar
Kim H-J, Baek J-W, Chung K (2021) Associative knowledge graph using fuzzy clustering and Min-max normalization in video contents. IEEE Access 9:74802–74816. https://doi.org/10.1109/ACCESS.2021.3080180
Article Google Scholar
Kumar A, Zhang D (2007) Hand-geometry recognition using entropy-based discretization. IEEE Trans Inform Foren Sec 2(2):181–187. https://doi.org/10.1109/TIFS.2007.896915
Article Google Scholar
Lee Y-J, Yeh Y-R, Wang Y-CF (2013) Anomaly detection via online oversampling principal component analysis. IEEE Trans Knowl Data Eng 25(7):1460–1470. https://doi.org/10.1109/TKDE.2012.99
Article Google Scholar
Liu H, Setiono R (1997) Feature selection via discretization. IEEE Trans Knowl Data Eng 9(4):642–645. https://doi.org/10.1109/69.617056
Article Google Scholar
Liu X, Wang H (2005) A discretization algorithm based on a heterogeneity criterion. IEEE Trans Knowl Data Eng 17(9):1166–1173. https://doi.org/10.1109/TKDE.2005.135
Article Google Scholar
Luan S, Zonghua G, Freidovich LB, Jiang L, Zhao Q (2021) Out-of-distribution detection for deep neural networks with isolation Forest and local outlier factor. IEEE Access 9:132980–132989. https://doi.org/10.1109/ACCESS.2021.3108451
Article Google Scholar
McMahon P, Zhang T, Dwight RA (2020) Approaches to dealing with missing data in railway asset management. IEEE Access 8:48177–48194. https://doi.org/10.1109/ACCESS.2020.2978902
Article Google Scholar
Nowak-Brzezińska A, Xięski T (2017) Outlier Mining Using the Dbscan Algorithm. J Appl Comput Sci 25(2):53–68. https://doi.org/10.34658/jacs.2017.2.53-68
Article Google Scholar
Pandey A, Jain A (2017) Comparative Analysis of Knn Algorithm Using Various Normalization Techniques. I. J. Computer Network and Information Security 11
Patro S, Krishna G, Sahu KK (2015) Normalization: A Preprocessing Stage. ArXiv abs/1503.06462
Potdar K, Pardawala T, Pai C (2017) A comparative study of categorical variable encoding techniques for neural network classifiers. Int J Comput Appl 175(October):7–9. https://doi.org/10.5120/ijca2017915495
Article Google Scholar
Real-time 3D face alignment using an encoder-decoder network with an efficient deconvolution layer (n.d.)
Sankepally SR, Kosaraju N, Mallikharjuna Rao K (n.d.) Data Imputation Techniques. An Empirical Study using Chronic Kidney Disease and Life Expectancy dataset," IEEE conference proceedings, 3rd International Conference on Innovative Trends in Information Technology (ICITIIT'22) (Accepted)
Spratt M, Carpenter JR, Sterne JAC, Carlin JB, Heron J, Henderson JA, Tilling K (2010) Strategies for multiple imputation in longitudinal studies. Am J Epidemiol 172(4):478–487
Article Google Scholar
Sunitha L, Sasikiran J, BalRaju M (2014) Automatic outlier identification in data mining using Iqr in real-time data.” International Journal of Advanced Research in Computer and Communication Engineering 3 (6)
Urvoy M, Autrusseau F (2014) Application of Grubbs’ Test for Outliers to the Detection of Watermarks. In: IH&MMSec ‘14
Uyar A, Ayse B, Nadir Ciray H, Bahceci M. 2009 A frequency based encoding technique for transformation of categorical variables in mixed Ivf dataset. In 2009 annual international conference of the Ieee engineering in medicine and biology society, 6214–7. https://doi.org/10.1109/IEMBS.2009.5334548
van Capelleveen G, Poel M, Mueller RM, Thornton D, van Hillegersberg J (2016) Outlier detection in healthcare fraud: a case study in the Medicaid dental domain. Int J Account Inf Syst 21:18–31. https://doi.org/10.1016/j.accinf.2016.04.001
Article Google Scholar
Wang H, Bah MJ, Hammad M (2019) Progress in outlier detection techniques: a survey. IEEE Access 7:107964–108000. https://doi.org/10.1109/ACCESS.2019.2932769
Article Google Scholar
Wilson MD, Lueck K (2014) Working with missing data: imputation of nonresponse items in categorical survey data with a non-monotone missing pattern. J Appl Math 2014:1–9. https://doi.org/10.1155/2014/368791
Article MATH Google Scholar
Xu X, Chong W, Li S, Arabo A, Xiao J (2018) MIAEC: missing data imputation based on the evidence chain. IEEE Access 6:12983–12992. https://doi.org/10.1109/ACCESS.2018.2803755
Article Google Scholar
Yousef WA, Traoré I, Briguglio W (2021) UN-avoids: unsupervised and nonparametric approach for visualizing outliers and invariant detection scoring. IEEE Trans Inform Foren Sec 16:5195–5210. https://doi.org/10.1109/TIFS.2021.3125608
Article Google Scholar
Yuan P, Mao Z, Wang B (2020) A pruned support vector data description-based outlier detection method: applied to robust process monitoring. Trans Inst Meas Control 42(11):2113–2126. https://doi.org/10.1177/0142331220905951
Article Google Scholar

Download references

Author information

Authors and Affiliations

Data Science and Artificial Intelligence, International Institute of Information Technology Naya Raipur, Naya Raipur, India
K Mallikharjuna Rao & Ghanta Saikrishna
Computer Science and Engineering, International Institute of Information Technology Naya Raipur, Naya Raipur, India
Kundrapu Supriya

Authors

K Mallikharjuna Rao
View author publications
You can also search for this author in PubMed Google Scholar
Ghanta Saikrishna
View author publications
You can also search for this author in PubMed Google Scholar
Kundrapu Supriya
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

The authors reviewed and summarized comprehensive arguments to select effective data processing mechanisms for the house price analysis dataset.

Corresponding authors

Correspondence to K Mallikharjuna Rao or Ghanta Saikrishna.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Mallikharjuna Rao, K., Saikrishna, G. & Supriya, K. Data preprocessing techniques: emergence and selection towards machine learning models - a practical review using HPA dataset. Multimed Tools Appl 82, 37177–37196 (2023). https://doi.org/10.1007/s11042-023-15087-5

Download citation

Received: 03 February 2022
Revised: 08 February 2023
Accepted: 02 March 2023
Published: 18 March 2023
Issue Date: October 2023
DOI: https://doi.org/10.1007/s11042-023-15087-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data preprocessing techniques: emergence and selection towards machine learning models - a practical review using HPA dataset

Abstract

Access this article

Similar content being viewed by others

Forecasting House Price with an Optimum Set of Features

A Novel Hybrid House Price Prediction Model

A Critical Review on Data Preprocessing Techniques for Building Operational Data Analysis

References

Author information

Authors and Affiliations

Contributions

Corresponding authors

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Data preprocessing techniques: emergence and selection towards machine learning models - a practical review using HPA dataset

Abstract

Access this article

Similar content being viewed by others

Forecasting House Price with an Optimum Set of Features

A Novel Hybrid House Price Prediction Model

A Critical Review on Data Preprocessing Techniques for Building Operational Data Analysis

References

Author information

Authors and Affiliations

Contributions

Corresponding authors

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation