Skip to main content

Using Unsupervised Machine Learning for Data Quality. Application to Financial Governmental Data Integration

  • Conference paper
  • First Online:
Book cover Big Data, Cloud and Applications (BDCA 2018)

Abstract

Data quality, means, that data are correct, reliable, accurate and valid to be used and to serve its purpose in a given context. Data quality is crucial to make right decisions and reports in every organization. However, huge volume of data produced by organizations or redundant and heterogeneous data integration make manual methods of data quality control difficult, for that using intelligent technologies like Machine Learning is essential to ensure data quality across the organization. In this paper, we present an unsupervised learning approach that aims to match similar names and group them in same cluster to correct data therefore ensure data quality. Our approach is validated in the context of financial data quality of taxpayers using scikit learn the machine learning library for the Python programming language.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. English, L.P.: Improving Data Warehouse and Business Information Quality: Methods for Reducing Costs and Increasing Profits. Wiley, New York (1999)

    Google Scholar 

  2. Recchia, G., Louwerse, M.M.: A Comparison of String Similarity Measures for Toponym Matching, pp. 54–61 (2013)

    Google Scholar 

  3. Christen, P.: A comparison of personal name matching: techniques and practical issues. In: IEEE, pp. 290–294 (2006)

    Google Scholar 

  4. Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A comparison of string distance metrics for name-matching tasks. Paper Presented at the Proceedings of the 2003 International Conference on Information Integration on the Web, Acapulco, Mexico (2003)

    Google Scholar 

  5. Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P., Fienberg, S.: Adaptive name matching in information integration. IEEE Intell. Syst. 18(5), 16–23 (2003)

    Article  Google Scholar 

  6. Pasick, R.J., Stewart, S.L., Bird, J.A., D’onofrio, C.N.: Quality of data in multiethnic health surveys. Public Health Rep. 116, 223–243 (2016)

    Article  Google Scholar 

  7. Peer, E., Vosgerau, J., Acquisti, A.: Reputation as a sufficient condition for data quality on Amazon Mechanical Turk. Behav. Res. Methods 46(4), 1023–1031 (2014)

    Article  Google Scholar 

  8. Kwon, O., Lee, N., Shin, B.: Data quality management, data usage experience and acquisition intention of big data analytics. Int. J. Inf. Manag. 34(3), 387–394 (2014)

    Article  Google Scholar 

  9. Cordier, T., Esling, P., Lejzerowicz, F., Visco, J., Ouadahi, A., Martins, C., Cedhagen, T., Pawlowski, J.: Predicting the ecological quality status of marine environments from eDNA metabarcoding data using supervised machine learning. Environ. Sci. Technol. 51(16), 9118–9126 (2017)

    Article  Google Scholar 

  10. Pipino, L.L., Lee, Y.W., Wang, R.Y.: Data quality assessment. Commun. ACM 45(4), 211–218 (2002)

    Article  Google Scholar 

  11. Hazen, B.T., Boone, C.A., Ezell, J.D., Jones-Farmer, L.A.: Data quality for data science, predictive analytics, and big data in supply chain management: an introduction to the problem and suggestions for research and applications. Int. J. Prod. Econ. 154, 72–80 (2014)

    Article  Google Scholar 

  12. Mikkelsen, L., Phillips, D.E., AbouZahr, C., Setel, P.W., De Savigny, D., Lozano, R., Lopez, A.D.: A global assessment of civil registration and vital statistics systems: monitoring data quality and progress. Lancet 386(10001), 1395–1406 (2015)

    Article  Google Scholar 

  13. Frey, B.J., Dueck, D.: Clustering by passing messages between data points. Science 315(5814), 972–976 (2007)

    Article  MathSciNet  Google Scholar 

  14. Sharma, I., Motwani, M.: An efficient text clustering approach using biased affinity propagation. Int. J. Comput. Appl. 96 (1) (2014)

    Google Scholar 

  15. Hung, W.-C., Chu, C.-Y., Wu, Y.-L., Tang, C.-Y.: Map/reduce affinity propagation clustering algorithm. Int. J. Electron. Electr. Eng. 3(4), 311–317 (2015)

    Google Scholar 

  16. Zhang, X., Furtlehner, C., Germain-Renaud, C., Sebag, M.: Data stream clustering with affinity propagation. IEEE Trans. Knowl. Data Eng. 26(7), 1644–1656 (2014)

    Article  Google Scholar 

  17. Limin, W., Li, Z., Xuming, H., Qiang, J., Guangyu, M., Ying, L.: An improved affinity propagation clustering algorithm based on entropy weight method and principal component analysis. Int. J. Database Theor. Appl. 9(6), 227–238 (2016)

    Article  Google Scholar 

  18. Hanley, J.A., McNeil, B.J.: The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1), 29–36 (1982)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hanae Necba .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Necba, H., Rhanoui, M., El Asri, B. (2018). Using Unsupervised Machine Learning for Data Quality. Application to Financial Governmental Data Integration. In: Tabii, Y., Lazaar, M., Al Achhab, M., Enneya, N. (eds) Big Data, Cloud and Applications. BDCA 2018. Communications in Computer and Information Science, vol 872. Springer, Cham. https://doi.org/10.1007/978-3-319-96292-4_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-96292-4_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-96291-7

  • Online ISBN: 978-3-319-96292-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics