Skip to main content

Advertisement

Log in

Data preprocessing techniques: emergence and selection towards machine learning models - a practical review using HPA dataset

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

To compute the frequent metamorphosis of the housing price, the House Price Index (HPI) is one of the effective indicators. Various methodologies are involved in data processing the current house prices, which are affected by factors like house configuration, building class, air conditioning quality, etc. Remarkably, more research papers adopting classical machine learning approaches are introduced to estimate house sale prices accurately. Still, they barely regard the data processing techniques that make the data suitable for modeling more accurate house prices forecasting architectures. This research contributes to a wide variety of adequate data pre-processing. It highlights mechanisms like missingness of data, missing data handling, categorical feature encoding, discretization, outliers, and feature scaling extensively to build efficient predictive models. Comprehensive arguments have been broadly presented to portray the advantages and disadvantages of prevailed data pre-processing techniques at various distribution scenarios of variables in the house price data. The current research conclusions oblige the evolution of modern data-driven research in machine learning.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Algorithm 1
Algorithm 2
Algorithm 3
Algorithm 4
Algorithm 5
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Adetunji AB, Akande ON, Ajala FA, Oyewo O, Akande YF, Oluwadara G (2022) House price prediction using random forest machine learning technique. Proc Comput Sci 199:806–813

    Article  Google Scholar 

  2. An Efficient Joint Bayesian Model with Soft Biometric Traits for Finger Vein Recognition (n.d.)

  3. Anand V, Mamidi V (2020) Multiple imputation of missing data in marketing. In: 2020 International Conference on Data Analytics for Business and Industry: way Towards a Sustainable Economy (Icdabi). p. 1–6. https://doi.org/10.1109/ICDABI51230.2020.9325602

  4. Anusha PV, Chandra Murty v, Anuradha CH (2019) Detecting outliers in high dimensional datasets using Z-score methodology. Int J Innov Technol Exploring Engin 9(1):48–53

    Article  Google Scholar 

  5. Dahouda MK, Joe I (2021) A deep-learned embedding technique for categorical features encoding. IEEE Access 9:114381–114391. https://doi.org/10.1109/ACCESS.2021.3104357

    Article  Google Scholar 

  6. Doulah MS, Islam H (2019) An Alternative Robust Measure of Outlier Detection in Univariate Data Sets. April, 1–11

  7. Dow MM, Anthon Eff E (2009) Cultural trait transmission and missing data as sources of Bias in cross-cultural survey research: explanations of polygyny re-examined. Cross-Cult Res 43(2):134–151. https://doi.org/10.1177/1069397109331612

    Article  Google Scholar 

  8. Emmanuel MT (2021) A survey on missing data in machine learning. Journal of Big Data, no. 140. https://doi.org/10.1186/s40537-021-00516-9

  9. Farhangfar A, Kurgan LA, Pedrycz W (2007) A novel framework for imputation of missing values in databases. IEEE Trans Syst Man Cybern Syst Hum 37(5):692–709. https://doi.org/10.1109/TSMCA.2007.902631

    Article  Google Scholar 

  10. Friedman L, Komogortsev OV (2019) Assessment of the effectiveness of seven biometric feature normalization techniques. IEEE Trans Inform Foren Sec 14(10):2528–2536. https://doi.org/10.1109/TIFS.2019.2904844

    Article  Google Scholar 

  11. Gupta M, Gao J, Aggarwal CC, Han J (2014) Outlier detection for temporal data: a survey. IEEE Trans Knowl Data Eng 26(9):2250–2267. https://doi.org/10.1109/TKDE.2013.184

    Article  Google Scholar 

  12. Hacibeyoglu M, Ibrahim M (2018) EF_Unique: an improved version of unsupervised equal frequency discretization method. Arab J Sci Eng 43(March):7695–7704. https://doi.org/10.1007/s13369-018-3144-z

    Article  Google Scholar 

  13. He X, Min F, Zhu W (2014) Comparison of discretization approaches for granular association rule mining. Can J Electr Comput Eng 37(3):157–167. https://doi.org/10.1109/CJECE.2014.2343258

    Article  Google Scholar 

  14. Hien D, Thi C, Tran A, Dao S, Nguyen GC (2020) Optimize the Combination of Categorical Variable Encoding and Deep Learning Technique for the Problem of Prediction of Vietnamese Student Academic Performance. International Journal of Advanced Computer Science and Applications 11 (January). https://doi.org/10.14569/IJACSA.2020.0111135

  15. Jadhav A, Pramod D, Ramanathan K (2019) Comparison of performance of data imputation methods for numeric dataset. Appl Artif Intell 33(10):913–933. https://doi.org/10.1080/08839514.2019.1637138

    Article  Google Scholar 

  16. Jose J, Vishwakarma GK, Bhattacharjee A (2021) Illustration of missing data handling technique generated from hepatitis C induced hepatocellular carcinoma cohort study. J King Saud Univ–Sci 33:101403

    Article  Google Scholar 

  17. Kang H (2013) The prevention and handling of the missing data. Korean J Anesthes 64(May):402–406. https://doi.org/10.4097/kjae.2013.64.5.402

    Article  Google Scholar 

  18. Khan SI, Hoque ASML (2020) SICE: an improved missing data imputation technique. J Big Data 7:37. https://doi.org/10.1186/s40537-020-00313-w

    Article  Google Scholar 

  19. Kim H-J, Baek J-W, Chung K (2021) Associative knowledge graph using fuzzy clustering and Min-max normalization in video contents. IEEE Access 9:74802–74816. https://doi.org/10.1109/ACCESS.2021.3080180

    Article  Google Scholar 

  20. Kumar A, Zhang D (2007) Hand-geometry recognition using entropy-based discretization. IEEE Trans Inform Foren Sec 2(2):181–187. https://doi.org/10.1109/TIFS.2007.896915

    Article  Google Scholar 

  21. Lee Y-J, Yeh Y-R, Wang Y-CF (2013) Anomaly detection via online oversampling principal component analysis. IEEE Trans Knowl Data Eng 25(7):1460–1470. https://doi.org/10.1109/TKDE.2012.99

    Article  Google Scholar 

  22. Liu H, Setiono R (1997) Feature selection via discretization. IEEE Trans Knowl Data Eng 9(4):642–645. https://doi.org/10.1109/69.617056

    Article  Google Scholar 

  23. Liu X, Wang H (2005) A discretization algorithm based on a heterogeneity criterion. IEEE Trans Knowl Data Eng 17(9):1166–1173. https://doi.org/10.1109/TKDE.2005.135

    Article  Google Scholar 

  24. Luan S, Zonghua G, Freidovich LB, Jiang L, Zhao Q (2021) Out-of-distribution detection for deep neural networks with isolation Forest and local outlier factor. IEEE Access 9:132980–132989. https://doi.org/10.1109/ACCESS.2021.3108451

    Article  Google Scholar 

  25. McMahon P, Zhang T, Dwight RA (2020) Approaches to dealing with missing data in railway asset management. IEEE Access 8:48177–48194. https://doi.org/10.1109/ACCESS.2020.2978902

    Article  Google Scholar 

  26. Nowak-Brzezińska A, Xięski T (2017) Outlier Mining Using the Dbscan Algorithm. J Appl Comput Sci 25(2):53–68. https://doi.org/10.34658/jacs.2017.2.53-68

    Article  Google Scholar 

  27. Pandey A, Jain A (2017) Comparative Analysis of Knn Algorithm Using Various Normalization Techniques. I. J. Computer Network and Information Security 11

  28. Patro S, Krishna G, Sahu KK (2015) Normalization: A Preprocessing Stage. ArXiv abs/1503.06462

  29. Potdar K, Pardawala T, Pai C (2017) A comparative study of categorical variable encoding techniques for neural network classifiers. Int J Comput Appl 175(October):7–9. https://doi.org/10.5120/ijca2017915495

    Article  Google Scholar 

  30. Real-time 3D face alignment using an encoder-decoder network with an efficient deconvolution layer (n.d.)

  31. Sankepally SR, Kosaraju N, Mallikharjuna Rao K (n.d.) Data Imputation Techniques. An Empirical Study using Chronic Kidney Disease and Life Expectancy dataset," IEEE conference proceedings, 3rd International Conference on Innovative Trends in Information Technology (ICITIIT'22) (Accepted)

  32. Spratt M, Carpenter JR, Sterne JAC, Carlin JB, Heron J, Henderson JA, Tilling K (2010) Strategies for multiple imputation in longitudinal studies. Am J Epidemiol 172(4):478–487

    Article  Google Scholar 

  33. Sunitha L, Sasikiran J, BalRaju M (2014) Automatic outlier identification in data mining using Iqr in real-time data.” International Journal of Advanced Research in Computer and Communication Engineering 3 (6)

  34. Urvoy M, Autrusseau F (2014) Application of Grubbs’ Test for Outliers to the Detection of Watermarks. In: IH&MMSec ‘14

  35. Uyar A, Ayse B, Nadir Ciray H, Bahceci M. 2009 A frequency based encoding technique for transformation of categorical variables in mixed Ivf dataset. In 2009 annual international conference of the Ieee engineering in medicine and biology society, 6214–7. https://doi.org/10.1109/IEMBS.2009.5334548

  36. van Capelleveen G, Poel M, Mueller RM, Thornton D, van Hillegersberg J (2016) Outlier detection in healthcare fraud: a case study in the Medicaid dental domain. Int J Account Inf Syst 21:18–31. https://doi.org/10.1016/j.accinf.2016.04.001

    Article  Google Scholar 

  37. Wang H, Bah MJ, Hammad M (2019) Progress in outlier detection techniques: a survey. IEEE Access 7:107964–108000. https://doi.org/10.1109/ACCESS.2019.2932769

    Article  Google Scholar 

  38. Wilson MD, Lueck K (2014) Working with missing data: imputation of nonresponse items in categorical survey data with a non-monotone missing pattern. J Appl Math 2014:19. https://doi.org/10.1155/2014/368791

    Article  MATH  Google Scholar 

  39. Xu X, Chong W, Li S, Arabo A, Xiao J (2018) MIAEC: missing data imputation based on the evidence chain. IEEE Access 6:12983–12992. https://doi.org/10.1109/ACCESS.2018.2803755

    Article  Google Scholar 

  40. Yousef WA, Traoré I, Briguglio W (2021) UN-avoids: unsupervised and nonparametric approach for visualizing outliers and invariant detection scoring. IEEE Trans Inform Foren Sec 16:5195–5210. https://doi.org/10.1109/TIFS.2021.3125608

    Article  Google Scholar 

  41. Yuan P, Mao Z, Wang B (2020) A pruned support vector data description-based outlier detection method: applied to robust process monitoring. Trans Inst Meas Control 42(11):2113–2126. https://doi.org/10.1177/0142331220905951

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Contributions

The authors reviewed and summarized comprehensive arguments to select effective data processing mechanisms for the house price analysis dataset.

Corresponding authors

Correspondence to K Mallikharjuna Rao or Ghanta Saikrishna.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mallikharjuna Rao, K., Saikrishna, G. & Supriya, K. Data preprocessing techniques: emergence and selection towards machine learning models - a practical review using HPA dataset. Multimed Tools Appl 82, 37177–37196 (2023). https://doi.org/10.1007/s11042-023-15087-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-15087-5

Keywords

Navigation