Metadata Discovery Using Data Sampling and Exploratory Data Analysis

Khalid, Hiba; Wrembel, Robert; Zimányi, Esteban

doi:10.1007/978-3-030-32065-2_8

Hiba Khalid^10,11,
Robert Wrembel¹¹ &
Esteban Zimányi¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 11815))

Included in the following conference series:

International Conference on Model and Data Engineering

856 Accesses

Abstract

Metadata discovery is a prominent contributor towards understanding the semantics of data, relationships between data, and fundamental data features for the purpose of data management, query processing, and data integration. Metadata discovery is constantly evolving with the help of data profiling and manual annotators, resulting in various good quality data profiling techniques and tools. Even though, there are different metadata standards specified for distinct fields such as finance, biology, experimental physics, medicine, there is no generic method that discovers metadata automatically or presents them in a unified way. In this paper, we present a technique for discovering and generating metadata for data sources that do not provide explicit metadata. To this end, we apply exploratory data analysis to produce two kinds of metadata, i.e., administrative and technical, in order to find similarities between resources, w.r.t. their structures and contents. Our technique was evaluated experimentally. The results show that the technique allows to identify similar data sources and compute their similarity measures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Sakr, Sherif, Zomaya, Albert Y. (eds.): Encyclopedia of Big Data Technologies. Springer, Cham (2019). https://doi.org/10.1007/978-3-319-77525-8
Book Google Scholar
Abedjan, Z., Golab, L., Naumann, F.: Data profiling. In: IEEE International Conference on Data Engineering (ICDE), pp. 1432–1435 (2016)
Google Scholar
Aindrila Ghosh, J.M., Nashaat, M.: A comprehensive review of tools for exploratory analysis of tabular industrial datasets. Vis. Inform. 2, 235–253 (2018)
Article Google Scholar
Bauckmann, J., Leser, U., Naumann, F.: Efficiently computing inclusion dependencies for schema discovery. In: International Conference on Data Engineering Workshops, p. 2 (2006)
Google Scholar
Bouguettaya, A., Benatallah, B., Elmargamid, A.: Interconnecting Heterogeneous Information Systems. Springer, Boston (1998). https://doi.org/10.1007/978-1-4615-5567-4. Kluwer Academic Publishers, ISBN 0792382161
Book Google Scholar
Ceravolo, P., et al.: Big data semantics. J. Data Semant. 7(2), 65–85 (2018)
Article Google Scholar
Chen, C.L.P., Zhang, C.: Data-intensive applications, challenges, techniques and technologies: a survey on big data. Inf Sci. 275, 314–347 (2014)
Article Google Scholar
DublinCore: Dublin core metadata initiative. http://dublincore.org/specifications/dublin-core/
Duggan, J., et al.: The BigDAWG polystore system. SIGMOD Rec. 44(2), 11–16 (2015)
Article Google Scholar
Edvardsen, L.F.H.: Using the structural content of documents to automatically generate quality metadata. Ph.D. thesis, Norwegian University of Science and Technology (2013)
Google Scholar
Ehrlich, J., Roick, M., Schulze, L., Zwiener, J., Papenbrock, T., Naumann, F.: Holistic data profiling: simultaneous discovery of various metadata. In: International Conference on Extending Database Technology (EDBT), pp. 305–316 (2016)
Google Scholar
Elmagarmid, A., Rusinkiewicz, M., Sheth, A. (eds.): Management of Heterogeneous and Autonomous Database Systems. Morgan Kaufmann, San Francisco (1999)
Google Scholar
Gali, N., Mariescu-Istodor, R., Frnti, P.: Similarity measures for title matching. In: International Conference on Pattern Recognition (ICPR) (2016)
Google Scholar
Gallinucci, E., Golfarelli, M., Rizzi, S.: Schema profiling of document-oriented databases. Inf. Syst. 75, 13–25 (2018)
Article Google Scholar
Halevy, A.Y., et al.: Goods: organizing google’s datasets. In: ACM SIGMOD International Conference on Management of Data, pp. 795–806 (2016)
Google Scholar
Hewasinghage, M., Varga, J., Abelló, A., Zimányi, E.: Managing polyglot systems metadata with hypergraphs. In: International Conference on Conceptual Modeling (ER), pp. 463–478 (2018)
Google Scholar
IEEE: IEEE LOM: IEEE standard for learning object metadata. https://standards.ieee.org/standard/1484_12_1-2002.html
IEEE Standards Association: IEEE Big Data Governance and Metadata Management (BDGMM). https://standards.ieee.org/industry-connections/BDGMM-index.html
IEEELO: IEEE standard for learning object metadata. https://ieeexplore.ieee.org/document/1032843
Jarke, M., Lenzerini, M., Vassiliou, Y., Vassiliadis, P.: Fundamentals of Data Warehouses. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-662-05153-5
Book MATH Google Scholar
Kaggle: UK car accidents 2005–2015. https://www.kaggle.com/silicon99/dft-accident-data
Kolaitis, P.G.: Reflections on schema mappings, data exchange, and metadata management. In: ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS), pp. 107–109 (2018)
Google Scholar
Kunz, M., Puchta, A., Groll, S., Fuchs, L., Pernul, G.: Attribute quality management for dynamic identity and access management. J. Inf. Secur. Appl. 44, 64–79 (2019)
Google Scholar
Liu, M., Wang, Q.: Rogas: a declarative framework for network analytics. In: International Conference on Very Large Data Bases (VLDB), vol. 9, no. 13, pp. 1561–1564 (2016)
Google Scholar
March, F.D., Lopes, S., Petit, J.-M: Efficient algorithms for mining inclusion dependencies. In: International Conference on Extending Database Technology (EDBT), pp. 464–476 (2002)
Google Scholar
Poole, J., Chang, D., Tolbert, D., Mellor, D.: Common Warehouse Metamodel. Wiley, Developer’s Guide (2003)
Google Scholar
Russom, P.: Data lakes: purposes, practices, patterns, and platforms (2017). TDWI white paper
Google Scholar
SCORM: Scorm metadata structure. https://scorm.com/scorm-explained/technical-scorm/content-packaging/metadata-structure/
Stefanowski, J., Krawiec, K., Wrembel, R.: Exploring complex and big data. Appl. Math. Comput. Sci. 27(4), 669–679 (2017)
MathSciNet MATH Google Scholar
Terrizzano, I., Schwarz, P., Roth, M., Colino, J.E.: Data wrangling: the challenging journey from the wild to the lake. In: Conference on Innovative Data Systems Research (CIDR) (2015)
Google Scholar
UK Gov.: Road safety data. https://data.gov.uk/dataset/cb7ae6f0-4be6-4935-9277-47e5ce24a11f/road-safety-data
Varga, J., Romero, O., Pedersen, T.B., Thomsen, C.: Analytical metadata modeling for next generation BI systems. J. Syst. Softw. 144, 240–254 (2018)
Article Google Scholar
Wiederhold, G.: Mediators in the architecture of future information systems. IEEE Comput. 25(3), 38–49 (1992)
Article Google Scholar
Wu, D., Sakr, S., Zhu, L.: HDM: optimized big data processing with data provenance. In: International Conference on Extending Database Technology (EDBT), pp. 530–533 (2017)
Google Scholar
Wylot, M., Cudré-Mauroux, P., Hauswirth, M., Groth, P.T.: Storing, tracking, and querying provenance in linked data. IEEE Trans. Knowl. Data Eng. (TKDE) 29(8), 1751–1764 (2017)
Article Google Scholar

Download references

Acknowledgements

The work of Hiba Khalid is supported by the European Commission through the Erasmus Mundus Joint Doctorate project Information Technologies for Business Intelligence-Doctoral College (IT4BI-DC).

The work of Robert Wrembel is supported from the grant of the Polish National Agency for Academic Exchange, within the Bekker programme.

Author information

Authors and Affiliations

Université Libre de Bruxelles, Brussels, Belgium
Hiba Khalid & Esteban Zimányi
Poznan University of Technology, Poznań, Poland
Hiba Khalid & Robert Wrembel

Authors

Hiba Khalid
View author publications
You can also search for this author in PubMed Google Scholar
Robert Wrembel
View author publications
You can also search for this author in PubMed Google Scholar
Esteban Zimányi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hiba Khalid .

Editor information

Editors and Affiliations

UIUC Institute, Zhejiang University, Zhejiang, China
Klaus-Dieter Schewe
INPT-ENSEEIHT/IRIT, Toulouse, France
Neeraj Kumar Singh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Khalid, H., Wrembel, R., Zimányi, E. (2019). Metadata Discovery Using Data Sampling and Exploratory Data Analysis. In: Schewe, KD., Singh, N. (eds) Model and Data Engineering. MEDI 2019. Lecture Notes in Computer Science(), vol 11815. Springer, Cham. https://doi.org/10.1007/978-3-030-32065-2_8

Download citation

DOI: https://doi.org/10.1007/978-3-030-32065-2_8
Published: 21 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32064-5
Online ISBN: 978-3-030-32065-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics