Skip to main content

Metadata Discovery Using Data Sampling and Exploratory Data Analysis

  • Conference paper
  • First Online:
Book cover Model and Data Engineering (MEDI 2019)

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 11815))

Included in the following conference series:

  • 856 Accesses

Abstract

Metadata discovery is a prominent contributor towards understanding the semantics of data, relationships between data, and fundamental data features for the purpose of data management, query processing, and data integration. Metadata discovery is constantly evolving with the help of data profiling and manual annotators, resulting in various good quality data profiling techniques and tools. Even though, there are different metadata standards specified for distinct fields such as finance, biology, experimental physics, medicine, there is no generic method that discovers metadata automatically or presents them in a unified way. In this paper, we present a technique for discovering and generating metadata for data sources that do not provide explicit metadata. To this end, we apply exploratory data analysis to produce two kinds of metadata, i.e., administrative and technical, in order to find similarities between resources, w.r.t. their structures and contents. Our technique was evaluated experimentally. The results show that the technique allows to identify similar data sources and compute their similarity measures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Sakr, Sherif, Zomaya, Albert Y. (eds.): Encyclopedia of Big Data Technologies. Springer, Cham (2019). https://doi.org/10.1007/978-3-319-77525-8

    Book  Google Scholar 

  2. Abedjan, Z., Golab, L., Naumann, F.: Data profiling. In: IEEE International Conference on Data Engineering (ICDE), pp. 1432–1435 (2016)

    Google Scholar 

  3. Aindrila Ghosh, J.M., Nashaat, M.: A comprehensive review of tools for exploratory analysis of tabular industrial datasets. Vis. Inform. 2, 235–253 (2018)

    Article  Google Scholar 

  4. Bauckmann, J., Leser, U., Naumann, F.: Efficiently computing inclusion dependencies for schema discovery. In: International Conference on Data Engineering Workshops, p. 2 (2006)

    Google Scholar 

  5. Bouguettaya, A., Benatallah, B., Elmargamid, A.: Interconnecting Heterogeneous Information Systems. Springer, Boston (1998). https://doi.org/10.1007/978-1-4615-5567-4. Kluwer Academic Publishers, ISBN 0792382161

    Book  Google Scholar 

  6. Ceravolo, P., et al.: Big data semantics. J. Data Semant. 7(2), 65–85 (2018)

    Article  Google Scholar 

  7. Chen, C.L.P., Zhang, C.: Data-intensive applications, challenges, techniques and technologies: a survey on big data. Inf Sci. 275, 314–347 (2014)

    Article  Google Scholar 

  8. DublinCore: Dublin core metadata initiative. http://dublincore.org/specifications/dublin-core/

  9. Duggan, J., et al.: The BigDAWG polystore system. SIGMOD Rec. 44(2), 11–16 (2015)

    Article  Google Scholar 

  10. Edvardsen, L.F.H.: Using the structural content of documents to automatically generate quality metadata. Ph.D. thesis, Norwegian University of Science and Technology (2013)

    Google Scholar 

  11. Ehrlich, J., Roick, M., Schulze, L., Zwiener, J., Papenbrock, T., Naumann, F.: Holistic data profiling: simultaneous discovery of various metadata. In: International Conference on Extending Database Technology (EDBT), pp. 305–316 (2016)

    Google Scholar 

  12. Elmagarmid, A., Rusinkiewicz, M., Sheth, A. (eds.): Management of Heterogeneous and Autonomous Database Systems. Morgan Kaufmann, San Francisco (1999)

    Google Scholar 

  13. Gali, N., Mariescu-Istodor, R., Frnti, P.: Similarity measures for title matching. In: International Conference on Pattern Recognition (ICPR) (2016)

    Google Scholar 

  14. Gallinucci, E., Golfarelli, M., Rizzi, S.: Schema profiling of document-oriented databases. Inf. Syst. 75, 13–25 (2018)

    Article  Google Scholar 

  15. Halevy, A.Y., et al.: Goods: organizing google’s datasets. In: ACM SIGMOD International Conference on Management of Data, pp. 795–806 (2016)

    Google Scholar 

  16. Hewasinghage, M., Varga, J., Abelló, A., Zimányi, E.: Managing polyglot systems metadata with hypergraphs. In: International Conference on Conceptual Modeling (ER), pp. 463–478 (2018)

    Google Scholar 

  17. IEEE: IEEE LOM: IEEE standard for learning object metadata. https://standards.ieee.org/standard/1484_12_1-2002.html

  18. IEEE Standards Association: IEEE Big Data Governance and Metadata Management (BDGMM). https://standards.ieee.org/industry-connections/BDGMM-index.html

  19. IEEELO: IEEE standard for learning object metadata. https://ieeexplore.ieee.org/document/1032843

  20. Jarke, M., Lenzerini, M., Vassiliou, Y., Vassiliadis, P.: Fundamentals of Data Warehouses. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-662-05153-5

    Book  MATH  Google Scholar 

  21. Kaggle: UK car accidents 2005–2015. https://www.kaggle.com/silicon99/dft-accident-data

  22. Kolaitis, P.G.: Reflections on schema mappings, data exchange, and metadata management. In: ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS), pp. 107–109 (2018)

    Google Scholar 

  23. Kunz, M., Puchta, A., Groll, S., Fuchs, L., Pernul, G.: Attribute quality management for dynamic identity and access management. J. Inf. Secur. Appl. 44, 64–79 (2019)

    Google Scholar 

  24. Liu, M., Wang, Q.: Rogas: a declarative framework for network analytics. In: International Conference on Very Large Data Bases (VLDB), vol. 9, no. 13, pp. 1561–1564 (2016)

    Google Scholar 

  25. March, F.D., Lopes, S., Petit, J.-M: Efficient algorithms for mining inclusion dependencies. In: International Conference on Extending Database Technology (EDBT), pp. 464–476 (2002)

    Google Scholar 

  26. Poole, J., Chang, D., Tolbert, D., Mellor, D.: Common Warehouse Metamodel. Wiley, Developer’s Guide (2003)

    Google Scholar 

  27. Russom, P.: Data lakes: purposes, practices, patterns, and platforms (2017). TDWI white paper

    Google Scholar 

  28. SCORM: Scorm metadata structure. https://scorm.com/scorm-explained/technical-scorm/content-packaging/metadata-structure/

  29. Stefanowski, J., Krawiec, K., Wrembel, R.: Exploring complex and big data. Appl. Math. Comput. Sci. 27(4), 669–679 (2017)

    MathSciNet  MATH  Google Scholar 

  30. Terrizzano, I., Schwarz, P., Roth, M., Colino, J.E.: Data wrangling: the challenging journey from the wild to the lake. In: Conference on Innovative Data Systems Research (CIDR) (2015)

    Google Scholar 

  31. UK Gov.: Road safety data. https://data.gov.uk/dataset/cb7ae6f0-4be6-4935-9277-47e5ce24a11f/road-safety-data

  32. Varga, J., Romero, O., Pedersen, T.B., Thomsen, C.: Analytical metadata modeling for next generation BI systems. J. Syst. Softw. 144, 240–254 (2018)

    Article  Google Scholar 

  33. Wiederhold, G.: Mediators in the architecture of future information systems. IEEE Comput. 25(3), 38–49 (1992)

    Article  Google Scholar 

  34. Wu, D., Sakr, S., Zhu, L.: HDM: optimized big data processing with data provenance. In: International Conference on Extending Database Technology (EDBT), pp. 530–533 (2017)

    Google Scholar 

  35. Wylot, M., Cudré-Mauroux, P., Hauswirth, M., Groth, P.T.: Storing, tracking, and querying provenance in linked data. IEEE Trans. Knowl. Data Eng. (TKDE) 29(8), 1751–1764 (2017)

    Article  Google Scholar 

Download references

Acknowledgements

The work of Hiba Khalid is supported by the European Commission through the Erasmus Mundus Joint Doctorate project Information Technologies for Business Intelligence-Doctoral College (IT4BI-DC).

The work of Robert Wrembel is supported from the grant of the Polish National Agency for Academic Exchange, within the Bekker programme.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hiba Khalid .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Khalid, H., Wrembel, R., Zimányi, E. (2019). Metadata Discovery Using Data Sampling and Exploratory Data Analysis. In: Schewe, KD., Singh, N. (eds) Model and Data Engineering. MEDI 2019. Lecture Notes in Computer Science(), vol 11815. Springer, Cham. https://doi.org/10.1007/978-3-030-32065-2_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-32065-2_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-32064-5

  • Online ISBN: 978-3-030-32065-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics