Abstract
To compute the frequent metamorphosis of the housing price, the House Price Index (HPI) is one of the effective indicators. Various methodologies are involved in data processing the current house prices, which are affected by factors like house configuration, building class, air conditioning quality, etc. Remarkably, more research papers adopting classical machine learning approaches are introduced to estimate house sale prices accurately. Still, they barely regard the data processing techniques that make the data suitable for modeling more accurate house prices forecasting architectures. This research contributes to a wide variety of adequate data pre-processing. It highlights mechanisms like missingness of data, missing data handling, categorical feature encoding, discretization, outliers, and feature scaling extensively to build efficient predictive models. Comprehensive arguments have been broadly presented to portray the advantages and disadvantages of prevailed data pre-processing techniques at various distribution scenarios of variables in the house price data. The current research conclusions oblige the evolution of modern data-driven research in machine learning.
Similar content being viewed by others
References
Adetunji AB, Akande ON, Ajala FA, Oyewo O, Akande YF, Oluwadara G (2022) House price prediction using random forest machine learning technique. Proc Comput Sci 199:806–813
An Efficient Joint Bayesian Model with Soft Biometric Traits for Finger Vein Recognition (n.d.)
Anand V, Mamidi V (2020) Multiple imputation of missing data in marketing. In: 2020 International Conference on Data Analytics for Business and Industry: way Towards a Sustainable Economy (Icdabi). p. 1–6. https://doi.org/10.1109/ICDABI51230.2020.9325602
Anusha PV, Chandra Murty v, Anuradha CH (2019) Detecting outliers in high dimensional datasets using Z-score methodology. Int J Innov Technol Exploring Engin 9(1):48–53
Dahouda MK, Joe I (2021) A deep-learned embedding technique for categorical features encoding. IEEE Access 9:114381–114391. https://doi.org/10.1109/ACCESS.2021.3104357
Doulah MS, Islam H (2019) An Alternative Robust Measure of Outlier Detection in Univariate Data Sets. April, 1–11
Dow MM, Anthon Eff E (2009) Cultural trait transmission and missing data as sources of Bias in cross-cultural survey research: explanations of polygyny re-examined. Cross-Cult Res 43(2):134–151. https://doi.org/10.1177/1069397109331612
Emmanuel MT (2021) A survey on missing data in machine learning. Journal of Big Data, no. 140. https://doi.org/10.1186/s40537-021-00516-9
Farhangfar A, Kurgan LA, Pedrycz W (2007) A novel framework for imputation of missing values in databases. IEEE Trans Syst Man Cybern Syst Hum 37(5):692–709. https://doi.org/10.1109/TSMCA.2007.902631
Friedman L, Komogortsev OV (2019) Assessment of the effectiveness of seven biometric feature normalization techniques. IEEE Trans Inform Foren Sec 14(10):2528–2536. https://doi.org/10.1109/TIFS.2019.2904844
Gupta M, Gao J, Aggarwal CC, Han J (2014) Outlier detection for temporal data: a survey. IEEE Trans Knowl Data Eng 26(9):2250–2267. https://doi.org/10.1109/TKDE.2013.184
Hacibeyoglu M, Ibrahim M (2018) EF_Unique: an improved version of unsupervised equal frequency discretization method. Arab J Sci Eng 43(March):7695–7704. https://doi.org/10.1007/s13369-018-3144-z
He X, Min F, Zhu W (2014) Comparison of discretization approaches for granular association rule mining. Can J Electr Comput Eng 37(3):157–167. https://doi.org/10.1109/CJECE.2014.2343258
Hien D, Thi C, Tran A, Dao S, Nguyen GC (2020) Optimize the Combination of Categorical Variable Encoding and Deep Learning Technique for the Problem of Prediction of Vietnamese Student Academic Performance. International Journal of Advanced Computer Science and Applications 11 (January). https://doi.org/10.14569/IJACSA.2020.0111135
Jadhav A, Pramod D, Ramanathan K (2019) Comparison of performance of data imputation methods for numeric dataset. Appl Artif Intell 33(10):913–933. https://doi.org/10.1080/08839514.2019.1637138
Jose J, Vishwakarma GK, Bhattacharjee A (2021) Illustration of missing data handling technique generated from hepatitis C induced hepatocellular carcinoma cohort study. J King Saud Univ–Sci 33:101403
Kang H (2013) The prevention and handling of the missing data. Korean J Anesthes 64(May):402–406. https://doi.org/10.4097/kjae.2013.64.5.402
Khan SI, Hoque ASML (2020) SICE: an improved missing data imputation technique. J Big Data 7:37. https://doi.org/10.1186/s40537-020-00313-w
Kim H-J, Baek J-W, Chung K (2021) Associative knowledge graph using fuzzy clustering and Min-max normalization in video contents. IEEE Access 9:74802–74816. https://doi.org/10.1109/ACCESS.2021.3080180
Kumar A, Zhang D (2007) Hand-geometry recognition using entropy-based discretization. IEEE Trans Inform Foren Sec 2(2):181–187. https://doi.org/10.1109/TIFS.2007.896915
Lee Y-J, Yeh Y-R, Wang Y-CF (2013) Anomaly detection via online oversampling principal component analysis. IEEE Trans Knowl Data Eng 25(7):1460–1470. https://doi.org/10.1109/TKDE.2012.99
Liu H, Setiono R (1997) Feature selection via discretization. IEEE Trans Knowl Data Eng 9(4):642–645. https://doi.org/10.1109/69.617056
Liu X, Wang H (2005) A discretization algorithm based on a heterogeneity criterion. IEEE Trans Knowl Data Eng 17(9):1166–1173. https://doi.org/10.1109/TKDE.2005.135
Luan S, Zonghua G, Freidovich LB, Jiang L, Zhao Q (2021) Out-of-distribution detection for deep neural networks with isolation Forest and local outlier factor. IEEE Access 9:132980–132989. https://doi.org/10.1109/ACCESS.2021.3108451
McMahon P, Zhang T, Dwight RA (2020) Approaches to dealing with missing data in railway asset management. IEEE Access 8:48177–48194. https://doi.org/10.1109/ACCESS.2020.2978902
Nowak-Brzezińska A, Xięski T (2017) Outlier Mining Using the Dbscan Algorithm. J Appl Comput Sci 25(2):53–68. https://doi.org/10.34658/jacs.2017.2.53-68
Pandey A, Jain A (2017) Comparative Analysis of Knn Algorithm Using Various Normalization Techniques. I. J. Computer Network and Information Security 11
Patro S, Krishna G, Sahu KK (2015) Normalization: A Preprocessing Stage. ArXiv abs/1503.06462
Potdar K, Pardawala T, Pai C (2017) A comparative study of categorical variable encoding techniques for neural network classifiers. Int J Comput Appl 175(October):7–9. https://doi.org/10.5120/ijca2017915495
Real-time 3D face alignment using an encoder-decoder network with an efficient deconvolution layer (n.d.)
Sankepally SR, Kosaraju N, Mallikharjuna Rao K (n.d.) Data Imputation Techniques. An Empirical Study using Chronic Kidney Disease and Life Expectancy dataset," IEEE conference proceedings, 3rd International Conference on Innovative Trends in Information Technology (ICITIIT'22) (Accepted)
Spratt M, Carpenter JR, Sterne JAC, Carlin JB, Heron J, Henderson JA, Tilling K (2010) Strategies for multiple imputation in longitudinal studies. Am J Epidemiol 172(4):478–487
Sunitha L, Sasikiran J, BalRaju M (2014) Automatic outlier identification in data mining using Iqr in real-time data.” International Journal of Advanced Research in Computer and Communication Engineering 3 (6)
Urvoy M, Autrusseau F (2014) Application of Grubbs’ Test for Outliers to the Detection of Watermarks. In: IH&MMSec ‘14
Uyar A, Ayse B, Nadir Ciray H, Bahceci M. 2009 A frequency based encoding technique for transformation of categorical variables in mixed Ivf dataset. In 2009 annual international conference of the Ieee engineering in medicine and biology society, 6214–7. https://doi.org/10.1109/IEMBS.2009.5334548
van Capelleveen G, Poel M, Mueller RM, Thornton D, van Hillegersberg J (2016) Outlier detection in healthcare fraud: a case study in the Medicaid dental domain. Int J Account Inf Syst 21:18–31. https://doi.org/10.1016/j.accinf.2016.04.001
Wang H, Bah MJ, Hammad M (2019) Progress in outlier detection techniques: a survey. IEEE Access 7:107964–108000. https://doi.org/10.1109/ACCESS.2019.2932769
Wilson MD, Lueck K (2014) Working with missing data: imputation of nonresponse items in categorical survey data with a non-monotone missing pattern. J Appl Math 2014:1–9. https://doi.org/10.1155/2014/368791
Xu X, Chong W, Li S, Arabo A, Xiao J (2018) MIAEC: missing data imputation based on the evidence chain. IEEE Access 6:12983–12992. https://doi.org/10.1109/ACCESS.2018.2803755
Yousef WA, Traoré I, Briguglio W (2021) UN-avoids: unsupervised and nonparametric approach for visualizing outliers and invariant detection scoring. IEEE Trans Inform Foren Sec 16:5195–5210. https://doi.org/10.1109/TIFS.2021.3125608
Yuan P, Mao Z, Wang B (2020) A pruned support vector data description-based outlier detection method: applied to robust process monitoring. Trans Inst Meas Control 42(11):2113–2126. https://doi.org/10.1177/0142331220905951
Author information
Authors and Affiliations
Contributions
The authors reviewed and summarized comprehensive arguments to select effective data processing mechanisms for the house price analysis dataset.
Corresponding authors
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Mallikharjuna Rao, K., Saikrishna, G. & Supriya, K. Data preprocessing techniques: emergence and selection towards machine learning models - a practical review using HPA dataset. Multimed Tools Appl 82, 37177–37196 (2023). https://doi.org/10.1007/s11042-023-15087-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-15087-5
Keywords
- MCAR
- MAR
- MNAR
- Mean and median imputation
- KNN imputation
- Arbitrary imputation
- End of tail imputation
- One-hot encoding
- Ordinal encoding
- Mean encoding
- Label encoding
- KNN discretization
- Decision tree discretization
- Equal width and frequency discretization
- IQR outlier handling
- Grubbs test
- DBSCAN clustering
- Isolation Forest
- Normalization and standardization