The Impact of Data Preprocessing on Prediction Effectiveness

Kiersztyn, Adam; Kiersztyn, Krystyna

doi:10.1007/978-3-031-23492-7_30

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13588))

Included in the following conference series:

International Conference on Artificial Intelligence and Soft Computing

702 Accesses

Abstract

This study considers a very important issue, which is the impact of preprocessing on model performance. On the example of data describing taxicab trips in New York City, a model predicting the average speed of a trip was built. The effectiveness of the obtained model was examined using relative error. The results were compared with the models obtained after prior data cleaning from the records containing missing data. Additionally, the effect of removing outliers on model quality was examined. An integral part of the paper is the description of a new method of anomaly detection. The author’s method involves fuzzy classification of the declared distance into three classes. As an indicator to allow for classification, the percentage of redundant distance with respect to Manhattan distance was selected. The results of a wide range of numerical experiments confirm the necessity of preprocessing. Comparison of a number of competing anomaly detection and prediction model building methods allows for reasonable generalization of the obtained conclusions. Additionally, the skillful use of fuzzy sets for anomaly detection allowed the development of a method that can be generalized to other transportation issues.

The work was co-financed by the Lublin University of Technology Scientific Fund: FD-20/IT-3/002.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Designing and Developing a Model for Detecting Unusual Condition in Urban Street Network

Article 19 December 2024

Fuzzy-statistical prediction intervals from crisp regression models

Article 25 April 2019

Outlier Detection in Fuzzy Regressions

References

Aitkin, M., Wilson, G.T.: Mixture models, outliers, and the EM algorithm. Technometrics 22(3), 325–331 (1980)
Article MATH Google Scholar
Alasadi, S.A., Bhaya, W.S.: Review of data preprocessing techniques in data mining. J. Eng. Appl. Sci. 12(16), 4102–4107 (2017)
Google Scholar
Arabameri, A., Pradhan, B., Rezaei, K., Sohrabi, M., Kalantari, Z.: Gis-based landslide susceptibility mapping using numerical risk factor bivariate model and its ensemble with linear multivariate regression and boosted regression tree algorithms. J. Mt. Sci. 16(3), 595–618 (2019)
Article Google Scholar
Berthold, M.R.: Mixed fuzzy rule formation. Int. J. Approx. Reason. 32(2–3), 67–84 (2003)
Article MATH Google Scholar
Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: LOF: identifying density-based local outliers. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 93–104 (2000). https://doi.org/10.1145/342009.335388
Coppersmith, D., Hong, S.J., Hosking, J.R.: Partitioning nominal attributes in decision trees. Data Min. Knowl. Discov. 3(2), 197–217 (1999)
Article Google Scholar
Donovan, B., Work, D.: New York city taxi trip data (2010–2013) (2014). https://doi.org/10.13012/J8PN93H8
Friedman, J.H.: Stochastic gradient boosting. Comput. Stat. Data Anal. 38(4), 367–378 (2002)
Article MathSciNet MATH Google Scholar
Kamiran, F., Calders, T.: Data preprocessing techniques for classification without discrimination. Knowl. Inf. Syst. 33(1), 1–33 (2012)
Article Google Scholar
Karczmarek, P., Kiersztyn, A., Pedrycz, W.: Fuzzy set-based isolation forest. In: 2020 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), pp. 1–6. IEEE (2020)
Google Scholar
Karczmarek, P., Kiersztyn, A., Pedrycz, W., Al, E.: K-means-based isolation forest. Knowl.-Based Syst. 195, 105659 (2020)
Google Scholar
Karczmarek, P., Kiersztyn, A., Pedrycz, W., Czerwiński, D.: Fuzzy c-means-based isolation forest. Appl. Soft Comput. 106, 107354 (2021)
Article Google Scholar
Kiersztyn, A., Karczmarek, P., Kiersztyn, K., Pedrycz, W.: The concept of detecting and classifying anomalies in large data sets on a basis of information granules. In: 2020 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), pp. 1–7. IEEE (2020)
Google Scholar
Kiersztyn, A., Karczmarek, P., Kiersztyn, K., Pedrycz, W.: Detection and classification of anomalies in large data sets on the basis of information granules. IEEE Trans. Fuzzy Syst. 30(8), 2850–2860 (2021)
Article Google Scholar
Kiersztyn, A., et al.: Data imputation in related time series using fuzzy set-based techniques. In: 2020 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), pp. 1–8. IEEE (2020)
Google Scholar
Kiersztyn, A., et al.: A comprehensive analysis of the impact of selecting the training set elements on the correctness of classification for highly variable ecological data. In: 2021 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), pp. 1–6. IEEE (2021)
Google Scholar
Kiersztyn, K.: Intuitively adaptable outlier detector. Stat. Anal. Data Min.: ASAData Sci. J. 15(4), 463–479 (2021)
Google Scholar
Liu, F.T., Ting, K.M., Zhou, Z.H.: Isolation-based anomaly detection. ACM Trans. Knowl. Discov. Data 6(1) (2012). https://doi.org/10.1145/2133360.2133363
Łopucki, R., Kiersztyn, A., Pitucha, G., Kitowski, I.: Handling missing data in ecological studies: ignoring gaps in the dataset can distort the inference. Ecol. Modell. 468, 109964 (2022)
Article Google Scholar
Osman, M.S., Abu-Mahfouz, A.M., Page, P.R.: A survey on data imputation techniques: water distribution system as a use case. IEEE Access 6, 63279–63291 (2018)
Article Google Scholar
Piironen, J., Vehtari, A.: Comparison of Bayesian predictive methods for model selection. Stat. Comput. 27(3), 711–735 (2017)
Article MathSciNet MATH Google Scholar
Priyanka, K.D.: Decision tree classifier: a detailed survey. Int. J. Inf. Decis. Sci. 12(3), 246–269 (2020)
Google Scholar
Raval, K.M.: Data mining techniques. Int. J. Adv. Res. Comput. Sci. Softw. Eng. 2(10) (2012)
Google Scholar
Rousseeuw, P.J., Driessen, K.V.: A fast algorithm for the minimum covariance determinant estimator. Technometrics 41(3), 212–223 (1999). https://doi.org/10.1080/00401706.1999.10485670
Article Google Scholar
Vijayarani, S., Ilamathi, M.J., Nithya, M., et al.: Preprocessing techniques for text mining-an overview. Int. J. Comput. Sci. Commun. Netw. 5(1), 7–16 (2015)
Google Scholar
Wang, H., Bah, M.J., Hammad, M.: Progress in outlier detection techniques: a survey. IEEE Access 7, 107964–108000 (2019)
Article Google Scholar
Wu, C., Chau, K.W., Fan, C.: Prediction of rainfall time series using modular artificial neural networks coupled with data-preprocessing techniques. J. Hydrol. 389(1–2), 146–167 (2010)
Article Google Scholar
Zhang, Z.: Missing data imputation: focusing on single imputation. Ann. Transl. Med. 4(1) (2016)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Lublin University of Technology, Lublin, Poland
Adam Kiersztyn
Department of Mathematical Modelling, The John Paul II Catholic University of Lublin, Lublin, Poland
Krystyna Kiersztyn

Authors

Adam Kiersztyn
View author publications
You can also search for this author in PubMed Google Scholar
Krystyna Kiersztyn
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Adam Kiersztyn .

Editor information

Editors and Affiliations

Systems Research Institute of the Polish Academy of Sciences, Warsaw, Poland
Leszek Rutkowski
Częstochowa University of Technology, Częstochowa, Poland
Rafał Scherer
Częstochowa University of Technology, Częstochowa, Poland
Marcin Korytkowski
University of Alberta, Edmonton, AB, Canada
Witold Pedrycz
AGH University of Science and Technology, Kraków, Poland
Ryszard Tadeusiewicz
University of Louisville, Louisville, KY, USA
Jacek M. Zurada

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kiersztyn, A., Kiersztyn, K. (2023). The Impact of Data Preprocessing on Prediction Effectiveness. In: Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J.M. (eds) Artificial Intelligence and Soft Computing. ICAISC 2022. Lecture Notes in Computer Science(), vol 13588. Springer, Cham. https://doi.org/10.1007/978-3-031-23492-7_30

Download citation

DOI: https://doi.org/10.1007/978-3-031-23492-7_30
Published: 24 January 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-23491-0
Online ISBN: 978-3-031-23492-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

The Impact of Data Preprocessing on Prediction Effectiveness