Abstract
Data quality, means, that data are correct, reliable, accurate and valid to be used and to serve its purpose in a given context. Data quality is crucial to make right decisions and reports in every organization. However, huge volume of data produced by organizations or redundant and heterogeneous data integration make manual methods of data quality control difficult, for that using intelligent technologies like Machine Learning is essential to ensure data quality across the organization. In this paper, we present an unsupervised learning approach that aims to match similar names and group them in same cluster to correct data therefore ensure data quality. Our approach is validated in the context of financial data quality of taxpayers using scikit learn the machine learning library for the Python programming language.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
English, L.P.: Improving Data Warehouse and Business Information Quality: Methods for Reducing Costs and Increasing Profits. Wiley, New York (1999)
Recchia, G., Louwerse, M.M.: A Comparison of String Similarity Measures for Toponym Matching, pp. 54–61 (2013)
Christen, P.: A comparison of personal name matching: techniques and practical issues. In: IEEE, pp. 290–294 (2006)
Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A comparison of string distance metrics for name-matching tasks. Paper Presented at the Proceedings of the 2003 International Conference on Information Integration on the Web, Acapulco, Mexico (2003)
Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P., Fienberg, S.: Adaptive name matching in information integration. IEEE Intell. Syst. 18(5), 16–23 (2003)
Pasick, R.J., Stewart, S.L., Bird, J.A., D’onofrio, C.N.: Quality of data in multiethnic health surveys. Public Health Rep. 116, 223–243 (2016)
Peer, E., Vosgerau, J., Acquisti, A.: Reputation as a sufficient condition for data quality on Amazon Mechanical Turk. Behav. Res. Methods 46(4), 1023–1031 (2014)
Kwon, O., Lee, N., Shin, B.: Data quality management, data usage experience and acquisition intention of big data analytics. Int. J. Inf. Manag. 34(3), 387–394 (2014)
Cordier, T., Esling, P., Lejzerowicz, F., Visco, J., Ouadahi, A., Martins, C., Cedhagen, T., Pawlowski, J.: Predicting the ecological quality status of marine environments from eDNA metabarcoding data using supervised machine learning. Environ. Sci. Technol. 51(16), 9118–9126 (2017)
Pipino, L.L., Lee, Y.W., Wang, R.Y.: Data quality assessment. Commun. ACM 45(4), 211–218 (2002)
Hazen, B.T., Boone, C.A., Ezell, J.D., Jones-Farmer, L.A.: Data quality for data science, predictive analytics, and big data in supply chain management: an introduction to the problem and suggestions for research and applications. Int. J. Prod. Econ. 154, 72–80 (2014)
Mikkelsen, L., Phillips, D.E., AbouZahr, C., Setel, P.W., De Savigny, D., Lozano, R., Lopez, A.D.: A global assessment of civil registration and vital statistics systems: monitoring data quality and progress. Lancet 386(10001), 1395–1406 (2015)
Frey, B.J., Dueck, D.: Clustering by passing messages between data points. Science 315(5814), 972–976 (2007)
Sharma, I., Motwani, M.: An efficient text clustering approach using biased affinity propagation. Int. J. Comput. Appl. 96 (1) (2014)
Hung, W.-C., Chu, C.-Y., Wu, Y.-L., Tang, C.-Y.: Map/reduce affinity propagation clustering algorithm. Int. J. Electron. Electr. Eng. 3(4), 311–317 (2015)
Zhang, X., Furtlehner, C., Germain-Renaud, C., Sebag, M.: Data stream clustering with affinity propagation. IEEE Trans. Knowl. Data Eng. 26(7), 1644–1656 (2014)
Limin, W., Li, Z., Xuming, H., Qiang, J., Guangyu, M., Ying, L.: An improved affinity propagation clustering algorithm based on entropy weight method and principal component analysis. Int. J. Database Theor. Appl. 9(6), 227–238 (2016)
Hanley, J.A., McNeil, B.J.: The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1), 29–36 (1982)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Necba, H., Rhanoui, M., El Asri, B. (2018). Using Unsupervised Machine Learning for Data Quality. Application to Financial Governmental Data Integration. In: Tabii, Y., Lazaar, M., Al Achhab, M., Enneya, N. (eds) Big Data, Cloud and Applications. BDCA 2018. Communications in Computer and Information Science, vol 872. Springer, Cham. https://doi.org/10.1007/978-3-319-96292-4_16
Download citation
DOI: https://doi.org/10.1007/978-3-319-96292-4_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-96291-7
Online ISBN: 978-3-319-96292-4
eBook Packages: Computer ScienceComputer Science (R0)