Abstract
Metadata discovery is a prominent contributor towards understanding the semantics of data, relationships between data, and fundamental data features for the purpose of data management, query processing, and data integration. Metadata discovery is constantly evolving with the help of data profiling and manual annotators, resulting in various good quality data profiling techniques and tools. Even though, there are different metadata standards specified for distinct fields such as finance, biology, experimental physics, medicine, there is no generic method that discovers metadata automatically or presents them in a unified way. In this paper, we present a technique for discovering and generating metadata for data sources that do not provide explicit metadata. To this end, we apply exploratory data analysis to produce two kinds of metadata, i.e., administrative and technical, in order to find similarities between resources, w.r.t. their structures and contents. Our technique was evaluated experimentally. The results show that the technique allows to identify similar data sources and compute their similarity measures.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Sakr, Sherif, Zomaya, Albert Y. (eds.): Encyclopedia of Big Data Technologies. Springer, Cham (2019). https://doi.org/10.1007/978-3-319-77525-8
Abedjan, Z., Golab, L., Naumann, F.: Data profiling. In: IEEE International Conference on Data Engineering (ICDE), pp. 1432–1435 (2016)
Aindrila Ghosh, J.M., Nashaat, M.: A comprehensive review of tools for exploratory analysis of tabular industrial datasets. Vis. Inform. 2, 235–253 (2018)
Bauckmann, J., Leser, U., Naumann, F.: Efficiently computing inclusion dependencies for schema discovery. In: International Conference on Data Engineering Workshops, p. 2 (2006)
Bouguettaya, A., Benatallah, B., Elmargamid, A.: Interconnecting Heterogeneous Information Systems. Springer, Boston (1998). https://doi.org/10.1007/978-1-4615-5567-4. Kluwer Academic Publishers, ISBN 0792382161
Ceravolo, P., et al.: Big data semantics. J. Data Semant. 7(2), 65–85 (2018)
Chen, C.L.P., Zhang, C.: Data-intensive applications, challenges, techniques and technologies: a survey on big data. Inf Sci. 275, 314–347 (2014)
DublinCore: Dublin core metadata initiative. http://dublincore.org/specifications/dublin-core/
Duggan, J., et al.: The BigDAWG polystore system. SIGMOD Rec. 44(2), 11–16 (2015)
Edvardsen, L.F.H.: Using the structural content of documents to automatically generate quality metadata. Ph.D. thesis, Norwegian University of Science and Technology (2013)
Ehrlich, J., Roick, M., Schulze, L., Zwiener, J., Papenbrock, T., Naumann, F.: Holistic data profiling: simultaneous discovery of various metadata. In: International Conference on Extending Database Technology (EDBT), pp. 305–316 (2016)
Elmagarmid, A., Rusinkiewicz, M., Sheth, A. (eds.): Management of Heterogeneous and Autonomous Database Systems. Morgan Kaufmann, San Francisco (1999)
Gali, N., Mariescu-Istodor, R., Frnti, P.: Similarity measures for title matching. In: International Conference on Pattern Recognition (ICPR) (2016)
Gallinucci, E., Golfarelli, M., Rizzi, S.: Schema profiling of document-oriented databases. Inf. Syst. 75, 13–25 (2018)
Halevy, A.Y., et al.: Goods: organizing google’s datasets. In: ACM SIGMOD International Conference on Management of Data, pp. 795–806 (2016)
Hewasinghage, M., Varga, J., Abelló, A., Zimányi, E.: Managing polyglot systems metadata with hypergraphs. In: International Conference on Conceptual Modeling (ER), pp. 463–478 (2018)
IEEE: IEEE LOM: IEEE standard for learning object metadata. https://standards.ieee.org/standard/1484_12_1-2002.html
IEEE Standards Association: IEEE Big Data Governance and Metadata Management (BDGMM). https://standards.ieee.org/industry-connections/BDGMM-index.html
IEEELO: IEEE standard for learning object metadata. https://ieeexplore.ieee.org/document/1032843
Jarke, M., Lenzerini, M., Vassiliou, Y., Vassiliadis, P.: Fundamentals of Data Warehouses. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-662-05153-5
Kaggle: UK car accidents 2005–2015. https://www.kaggle.com/silicon99/dft-accident-data
Kolaitis, P.G.: Reflections on schema mappings, data exchange, and metadata management. In: ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS), pp. 107–109 (2018)
Kunz, M., Puchta, A., Groll, S., Fuchs, L., Pernul, G.: Attribute quality management for dynamic identity and access management. J. Inf. Secur. Appl. 44, 64–79 (2019)
Liu, M., Wang, Q.: Rogas: a declarative framework for network analytics. In: International Conference on Very Large Data Bases (VLDB), vol. 9, no. 13, pp. 1561–1564 (2016)
March, F.D., Lopes, S., Petit, J.-M: Efficient algorithms for mining inclusion dependencies. In: International Conference on Extending Database Technology (EDBT), pp. 464–476 (2002)
Poole, J., Chang, D., Tolbert, D., Mellor, D.: Common Warehouse Metamodel. Wiley, Developer’s Guide (2003)
Russom, P.: Data lakes: purposes, practices, patterns, and platforms (2017). TDWI white paper
SCORM: Scorm metadata structure. https://scorm.com/scorm-explained/technical-scorm/content-packaging/metadata-structure/
Stefanowski, J., Krawiec, K., Wrembel, R.: Exploring complex and big data. Appl. Math. Comput. Sci. 27(4), 669–679 (2017)
Terrizzano, I., Schwarz, P., Roth, M., Colino, J.E.: Data wrangling: the challenging journey from the wild to the lake. In: Conference on Innovative Data Systems Research (CIDR) (2015)
UK Gov.: Road safety data. https://data.gov.uk/dataset/cb7ae6f0-4be6-4935-9277-47e5ce24a11f/road-safety-data
Varga, J., Romero, O., Pedersen, T.B., Thomsen, C.: Analytical metadata modeling for next generation BI systems. J. Syst. Softw. 144, 240–254 (2018)
Wiederhold, G.: Mediators in the architecture of future information systems. IEEE Comput. 25(3), 38–49 (1992)
Wu, D., Sakr, S., Zhu, L.: HDM: optimized big data processing with data provenance. In: International Conference on Extending Database Technology (EDBT), pp. 530–533 (2017)
Wylot, M., Cudré-Mauroux, P., Hauswirth, M., Groth, P.T.: Storing, tracking, and querying provenance in linked data. IEEE Trans. Knowl. Data Eng. (TKDE) 29(8), 1751–1764 (2017)
Acknowledgements
The work of Hiba Khalid is supported by the European Commission through the Erasmus Mundus Joint Doctorate project Information Technologies for Business Intelligence-Doctoral College (IT4BI-DC).
The work of Robert Wrembel is supported from the grant of the Polish National Agency for Academic Exchange, within the Bekker programme.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Khalid, H., Wrembel, R., Zimányi, E. (2019). Metadata Discovery Using Data Sampling and Exploratory Data Analysis. In: Schewe, KD., Singh, N. (eds) Model and Data Engineering. MEDI 2019. Lecture Notes in Computer Science(), vol 11815. Springer, Cham. https://doi.org/10.1007/978-3-030-32065-2_8
Download citation
DOI: https://doi.org/10.1007/978-3-030-32065-2_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32064-5
Online ISBN: 978-3-030-32065-2
eBook Packages: Computer ScienceComputer Science (R0)