Using Unsupervised Machine Learning for Data Quality. Application to Financial Governmental Data Integration

Necba, Hanae; Rhanoui, Maryem; El Asri, Bouchra

doi:10.1007/978-3-319-96292-4_16

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 872))

Included in the following conference series:

International Conference on Big Data, Cloud and Applications

1203 Accesses

Abstract

Data quality, means, that data are correct, reliable, accurate and valid to be used and to serve its purpose in a given context. Data quality is crucial to make right decisions and reports in every organization. However, huge volume of data produced by organizations or redundant and heterogeneous data integration make manual methods of data quality control difficult, for that using intelligent technologies like Machine Learning is essential to ensure data quality across the organization. In this paper, we present an unsupervised learning approach that aims to match similar names and group them in same cluster to correct data therefore ensure data quality. Our approach is validated in the context of financial data quality of taxpayers using scikit learn the machine learning library for the Python programming language.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

English, L.P.: Improving Data Warehouse and Business Information Quality: Methods for Reducing Costs and Increasing Profits. Wiley, New York (1999)
Google Scholar
Recchia, G., Louwerse, M.M.: A Comparison of String Similarity Measures for Toponym Matching, pp. 54–61 (2013)
Google Scholar
Christen, P.: A comparison of personal name matching: techniques and practical issues. In: IEEE, pp. 290–294 (2006)
Google Scholar
Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A comparison of string distance metrics for name-matching tasks. Paper Presented at the Proceedings of the 2003 International Conference on Information Integration on the Web, Acapulco, Mexico (2003)
Google Scholar
Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P., Fienberg, S.: Adaptive name matching in information integration. IEEE Intell. Syst. 18(5), 16–23 (2003)
Article Google Scholar
Pasick, R.J., Stewart, S.L., Bird, J.A., D’onofrio, C.N.: Quality of data in multiethnic health surveys. Public Health Rep. 116, 223–243 (2016)
Article Google Scholar
Peer, E., Vosgerau, J., Acquisti, A.: Reputation as a sufficient condition for data quality on Amazon Mechanical Turk. Behav. Res. Methods 46(4), 1023–1031 (2014)
Article Google Scholar
Kwon, O., Lee, N., Shin, B.: Data quality management, data usage experience and acquisition intention of big data analytics. Int. J. Inf. Manag. 34(3), 387–394 (2014)
Article Google Scholar
Cordier, T., Esling, P., Lejzerowicz, F., Visco, J., Ouadahi, A., Martins, C., Cedhagen, T., Pawlowski, J.: Predicting the ecological quality status of marine environments from eDNA metabarcoding data using supervised machine learning. Environ. Sci. Technol. 51(16), 9118–9126 (2017)
Article Google Scholar
Pipino, L.L., Lee, Y.W., Wang, R.Y.: Data quality assessment. Commun. ACM 45(4), 211–218 (2002)
Article Google Scholar
Hazen, B.T., Boone, C.A., Ezell, J.D., Jones-Farmer, L.A.: Data quality for data science, predictive analytics, and big data in supply chain management: an introduction to the problem and suggestions for research and applications. Int. J. Prod. Econ. 154, 72–80 (2014)
Article Google Scholar
Mikkelsen, L., Phillips, D.E., AbouZahr, C., Setel, P.W., De Savigny, D., Lozano, R., Lopez, A.D.: A global assessment of civil registration and vital statistics systems: monitoring data quality and progress. Lancet 386(10001), 1395–1406 (2015)
Article Google Scholar
Frey, B.J., Dueck, D.: Clustering by passing messages between data points. Science 315(5814), 972–976 (2007)
Article MathSciNet Google Scholar
Sharma, I., Motwani, M.: An efficient text clustering approach using biased affinity propagation. Int. J. Comput. Appl. 96 (1) (2014)
Google Scholar
Hung, W.-C., Chu, C.-Y., Wu, Y.-L., Tang, C.-Y.: Map/reduce affinity propagation clustering algorithm. Int. J. Electron. Electr. Eng. 3(4), 311–317 (2015)
Google Scholar
Zhang, X., Furtlehner, C., Germain-Renaud, C., Sebag, M.: Data stream clustering with affinity propagation. IEEE Trans. Knowl. Data Eng. 26(7), 1644–1656 (2014)
Article Google Scholar
Limin, W., Li, Z., Xuming, H., Qiang, J., Guangyu, M., Ying, L.: An improved affinity propagation clustering algorithm based on entropy weight method and principal component analysis. Int. J. Database Theor. Appl. 9(6), 227–238 (2016)
Article Google Scholar
Hanley, J.A., McNeil, B.J.: The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1), 29–36 (1982)
Article Google Scholar

Download references

Author information

Authors and Affiliations

IMS Team, ADMIR Laboratory, Rabat IT Center, ENSIAS, Mohammed V University, Rabat, Morocco
Hanae Necba, Maryem Rhanoui & Bouchra El Asri
Meridian Team, LYRICA Laboratory, School of Information Sciences, Rabat, Morocco
Maryem Rhanoui

Authors

Hanae Necba
View author publications
You can also search for this author in PubMed Google Scholar
Maryem Rhanoui
View author publications
You can also search for this author in PubMed Google Scholar
Bouchra El Asri
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hanae Necba .

Editor information

Editors and Affiliations

Abdelmalek Essaâdi University, Tétouan, Morocco
Youness Tabii
Abdelmalek Essaâdi University, Tétouan, Morocco
Mohamed Lazaar
Abdelmalek Essaâdi University, Tétouan, Morocco
Mohammed Al Achhab
Université Ibn-Tofail, Tétouan, Morocco
Nourddine Enneya

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Necba, H., Rhanoui, M., El Asri, B. (2018). Using Unsupervised Machine Learning for Data Quality. Application to Financial Governmental Data Integration. In: Tabii, Y., Lazaar, M., Al Achhab, M., Enneya, N. (eds) Big Data, Cloud and Applications. BDCA 2018. Communications in Computer and Information Science, vol 872. Springer, Cham. https://doi.org/10.1007/978-3-319-96292-4_16

Download citation

DOI: https://doi.org/10.1007/978-3-319-96292-4_16
Published: 14 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-96291-7
Online ISBN: 978-3-319-96292-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics