Skip to main content
Log in

Impacts of Dirty Data on Classification and Clustering Models: An Experimental Evaluation

  • Regular Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

Data quality issues have attracted widespread attentions due to the negative impacts of dirty data on data mining and machine learning results. The relationship between data quality and the accuracy of results could be applied on the selection of the appropriate model with the consideration of data quality and the determination of the data share to clean. However, rare research has focused on exploring such relationship. Motivated by this, this paper conducts an experimental comparison for the effects of missing, inconsistent, and conflicting data on classification and clustering models. From the experimental results, we observe that dirty-data impacts are related to the error type, the error rate, and the data size. Based on the findings, we suggest users leverage our proposed metrics, sensibility and data quality inflection point, for model selection and data cleaning.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Beskales G, Ilyas I F, Golab L, Galiullin A. On the relative trust between inconsistent data and inaccurate constraints. In Proc. the 29th IEEE Int. Conf. Data Engineering, Apr. 2013, pp.541-552. https://doi.org/10.1109/ICDE.2013.6544854.

  2. Chu X, Ilyas I F, Papotti P. Holistic data cleaning: Putting violations into context. In Proc. the 29th IEEE Int. Conf. Data Engineering, Apr. 2013, pp.458-469. https://doi.org/10.1109/ICDE.2013.6544847.

  3. Chu X, Morcos J, Ilyas I F, Ouzzani M, Papotti P, Tang N, Ye Y. KATARA: A data cleaning system powered by knowledge bases and crowdsourcing. In Proc. the 36th ACM Int. Conf. Management of Data, May 2015, pp.1247-1261. https://doi.org/10.1145/2723372.2749431.

  4. Hao S, Tang N, Li G, Li J. Cleaning relations using knowledge bases. In Proc. the 33rd IEEE Int. Conf. Data Engineering, Apr. 2017, pp.933-944. https://doi.org/10.1109/ICDE.2017.141.

  5. Wang J, Kraska T, Franklin M J, Feng J. CrowdER: Crowdsourcing entity resolution. Proceedings of the VLDB Endowment, 2012, 5(11): 1483-1494. https://doi.org/10.14778/2350229.2350263.

  6. Dallachiesa M, Ebaid A, Eldawy A, Elmagarmid A, Ilyas I F, Ouzzani M, Tang N. NADEEF: A commodity data cleaning system. In Proc. the 34th ACM Int. Conf. Management of Data, Jun. 2013, pp.541-552. https://doi.org/10.1145/2463676.2465327.

  7. Gamberger D, Lavrač N. Conditions for Occam’s razor applicability and noise elimination. In Proc. the 9th Springer Eur. Conf. Machine Learning, Apr. 1997, pp.108-123. https://doi.org/10.1007/3-540-62858-4_76.

  8. García-Laencina P J, Sancho-Gómez J L, Figueiras-Vidal A R. Pattern classification with missing data: A review. Neural Computing and Applications, 2010, 19(2): 263-282. https://doi.org/10.1007/s00521-009-0295-6.

    Article  Google Scholar 

  9. Lim S. Cleansing noisy city names in spatial data mining. In Proc. the 2010 Int. Conf. Information Science and Applications, Apr. 2010. https://doi.org/10.1109/ICISA.2010.5480390.

  10. Frénay B, Verleysen M. Classification in the presence of label noise: A survey. IEEE Trans. Neural Networks and Learning Systems, 2013, 25(5): 845-869. https://doi.org/10.1109/TNNLS.2013.2292894.

  11. Zhu X, Wu X. Class noise vs. attribute noise: A quantitative study. Artificial Intelligence Review, 2004, 22(3): 177-210. https://doi.org/10.1007/s10462-004-0751-8.

    Article  MATH  Google Scholar 

  12. Song S, Li C, Zhang X. Turn waste into wealth: On simultaneous clustering and cleaning over dirty data. In Proc. the 21st ACM Int. Conf. Knowledge Discovery and Data Mining, Aug. 2015, pp.1115-1124. https://doi.org/10.1145/2783258.2783317.

  13. Caruana R, Niculescu-Mizil A. An empirical comparison of supervised learning algorithms. In Proc. the 23rd ACM Int. Conf. Machine Learning, Jun. 2006, pp.161-168. https://doi.org/10.1145/1143844.1143865.

  14. Caruana R, Karampatziakis N, Yessenalina A. An empirical evaluation of supervised learning in high dimensions. In Proc. the 25th ACM Int. Conf. Machine Learning, Jul. 2008, pp.96-103. https://doi.org/10.1145/1390156.1390169.

  15. Ghotra B, McIntosh S, Hassan A E. Revisiting the impact of classification techniques on the performance of defect prediction models. In Proc. the 37th IEEE/ACM Int. Conf. Software Engineering, May 2015, pp.789-800. https://doi.org/10.1109/ICSE.2015.91.

  16. Kirchner K, Zec J, Delibašić B. Facilitating data preprocessing by a generic framework: A proposal for clustering. Artificial Intelligence Review, 2016, 45(3): 271-297. https://doi.org/10.1007/s10462-015-9446-6.

    Article  Google Scholar 

  17. Sidi F, Panahy P H S, Affendey L S, Jabar M A, Ibrahim H, Mustapha A. Data quality: A survey of data quality dimensions. In Proc. the 2nd IEEE Int. Conf. Information Retrieval and Knowledge Management, Mar. 2012, pp.300-304. https://doi.org/10.1109/InfRKM.2012.6204995.

  18. Fan W, Geerts F. Capturing missing tuples and missing values. In Proc. the 29th ACM SIGMOD-SIGACT-SIGART Symp. Principles of Database Systems, Jun. 2010, pp.169-178. https://doi.org/10.1145/1807085.1807109.

  19. Getoor L, Machanavajjhala A. Entity resolution: Theory, practice & open challenges. Proceedings of the VLDB Endowment, 2012, 5(12): 2018-2019. https://doi.org/10.14778/2367502.2367564.

  20. Arocena P C, Glavic B, Mecca G, Miller R J, Papotti P, Santoro D. Messing up with BART: Error generation for evaluating data-cleaning algorithms. Proceedings of the VLDB Endowment, 2015, 9(2): 36-47. https://doi.org/10.14778/2850578.2850579.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hong-Zhi Wang.

Supplementary Information

ESM 1

(PDF 161 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Qi, ZX., Wang, HZ. & Wang, AJ. Impacts of Dirty Data on Classification and Clustering Models: An Experimental Evaluation. J. Comput. Sci. Technol. 36, 806–821 (2021). https://doi.org/10.1007/s11390-021-1344-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-021-1344-6

Keywords

Navigation