Skip to main content

A New Metadata Model to Uniformly Handle Heterogeneous Data Lake Sources

  • Conference paper
  • First Online:

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 909))

Abstract

Metadata have always played a key role in favoring the cooperation of heterogeneous data sources. This role has become much more crucial with the advent of data lakes, in which case metadata represent the only possibility to guarantee an effective and efficient management of data source interoperability. For this reason, the necessity to define new models and paradigms for metadata representation and management appears crucial in the data lake scenario. In this paper, we aim at addressing this issue by proposing a new metadata model well suited for data lakes. Furthermore, to give an idea of its capabilities, we present an approach that leverages it to “structure” unstructured sources and to extract thematic views from heterogeneous data lake sources.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    http://wiki.dbpedia.org.

  2. 2.

    In this paper, we use the term “lemma” according to the meaning it has in BabelNet [17]. Here, given a term, its lemmas are other objects (terms, emoticons, etc.) contributing to specify its meaning.

  3. 3.

    Please note that Phases 3 and 4 could be merged in a unique one, avoiding to define arcs with label “similarTo”. Here, we maintain these arcs and both phases to keep the information about similarity between nodes for future use.

  4. 4.

    Whenever this does not happen, the mapping can be automatically provided by the DBpedia Lookup Service (http://wiki.dbpedia.org/projects/dbpedia-lookup).

  5. 5.

    Here, two nodes are equal if the corresponding name coincide.

  6. 6.

    In this figure, we do not show the arc labels for the sources C, W and E because all of them are “contains” and their presence would have complicated the layout unnecessarily.

  7. 7.

    Hereafter, we use the notation S.o to indicate the object o of the source S.

  8. 8.

    In this figure, for layout reasons, we do not show the arc labels because they are the same as the corresponding arcs of Fig. 2.

  9. 9.

    Prefixes dbo and dbr stand for http://dbpedia.org/ontology/ and http://dbpedia.org/resource/.

References

  1. Abiteboul, S., Duschka, O.M.: Complexity of answering queries using materialized views. In: Proceedings of the International Symposium on Principles of Database Systems (SIGMOD/PODS 1998), Seattle, WA, USA, 1998, pp. 254–263. ACM (1998)

    Google Scholar 

  2. Aversano, L., Intonti, R., Quattrocchi, C., Tortorella, M.: Building a virtual view of heterogeneous data source views. In: Proceedings of the International Conference on Software and Data Technologies (ICSOFT 2010), Athens, Greece, 2010, pp. 266–275. INSTICC Pressd (2010)

    Google Scholar 

  3. Bergamaschi, S., Castano, S., Vincini, M., Beneventano, D.: Semantic integration and query of heterogeneous information sources. Data Knowl. Eng. 36(3), 215–249 (2001)

    Article  Google Scholar 

  4. Bilalli, B., Abelló, A., Aluja-Banet, T., Wrembel, R.: Towards intelligent data analysis: the metadata challenge. In: Proceedings of the International Conference on Internet of Things and Big Data (IoTBD 2016), Roma, Italy, 2016, pp. 331–338 (2016)

    Google Scholar 

  5. Biskup, J., Embley, D.: Extracting information from heterogeneous information sources using ontologically specified target views. Inf. Syst. 28(3), 169–212 (2003). Elsevier

    Article  Google Scholar 

  6. De Meo, P., Quattrone, G., Terracina, G., Ursino, D.: Integration of XML schemas at various "severity" levels. Inf. Syst. 31(6), 397–434 (2006)

    Article  Google Scholar 

  7. Fan, W., Wang, X., Wu, Y.: Answering pattern queries using views. IEEE Trans. Knowl. Data Eng. 28(2), 326–341 (2016). IEEE

    Article  Google Scholar 

  8. Fang, H.: Managing data lakes in big data era: what’s a data lake and why has it became popular in data management ecosystem. In: Proceedings of the International Conference on Cyber Technology in Automation (CYBER 2015), Shenyang, China, 2015, pp. 820–824. IEEE (2015)

    Google Scholar 

  9. Farid, M., Roatis, A., Ilyas, I.F., Hoffmann, H., Chu, X.: CLAMS: bringing quality to Data Lakes. In: Proceedings of the International Conference on Management of Data (SIGMOD/PODS 2016), San Francisco, CA, USA, 2016, pp. 2089–2092. ACM (2016)

    Google Scholar 

  10. Hai, R., Geisler, S., Quix C.: Constance: an intelligent data lake system. In: Proceedings of the International Conference on Management of Data (SIGMOD/PODS 2016), San Francisco, CA, USA, 2016, pp. 2097–2100. ACM (2016)

    Google Scholar 

  11. Halevy, A.: Answering queries using views: a survey. VLDB J. 10(4), 270–294 (2001). Springer

    Article  Google Scholar 

  12. Hitzler, P., Janowicz, K.: Linked data, big data, and the 4th paradigm. Semant. Web 4(3), 233–235 (2013)

    Google Scholar 

  13. Dublin Core Metadata Initiative. DCMI metadata terms. Technical report (2012)

    Google Scholar 

  14. Keith, A., Cyganiak, R., Hausenblas, M., Zhao, J.: Describing linked datasets with the void vocabulary. Technical report (2011)

    Google Scholar 

  15. Kondrak, G.: N-gram similarity and distance. In: Consens, M., Navarro, G. (eds.) SPIRE 2005. LNCS, vol. 3772, pp. 115–126. Springer, Heidelberg (2005). https://doi.org/10.1007/11575832_13

    Chapter  Google Scholar 

  16. Madhavan, J., Bernstein, P.A., Rahm, E.: Generic schema matching with Cupid. In: Proceedings of the International Conference on Very Large Data Bases (VLDB 2001), Rome, Italy, 2001, pp. 49–58. Morgan Kaufmann (2001)

    Google Scholar 

  17. Navigli, R., Ponzetto, S.P.: BabelNet: the automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artif. Intell. 193, 217–250 (2012). Elsevier

    Article  MathSciNet  Google Scholar 

  18. Oram, A.: Managing the Data Lake. O’Reilly, Sebastopol (2015)

    Google Scholar 

  19. Palopoli, L., Pontieri, L., Terracina, G., Ursino, D.: Intensional and extensional integration and abstraction of heterogeneous databases. Data Knowl. Eng. 35(3), 201–237 (2000)

    Article  Google Scholar 

  20. Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10(4), 334–350 (2001)

    Article  Google Scholar 

  21. Singh, K., Singh, V.: Answering graph pattern query using incremental views. In: Proceedings of the International Conference on Computing (ICCCA 2016), Greater Noida, India, 2016, pp. 54–59. IEEE (2016)

    Google Scholar 

  22. Wang, J., Li, J., Yu, J.X.: Answering tree pattern queries using views: a revisit. In: Proceedings of the International Conference on Extending Database Technology (EDBT/ICDT 2011), Uppsala, Sweden, 2011, pp. 153–164. ACM (2011)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Domenico Ursino .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Diamantini, C., Giudice, P.L., Musarella, L., Potena, D., Storti, E., Ursino, D. (2018). A New Metadata Model to Uniformly Handle Heterogeneous Data Lake Sources. In: Benczúr, A., et al. New Trends in Databases and Information Systems. ADBIS 2018. Communications in Computer and Information Science, vol 909. Springer, Cham. https://doi.org/10.1007/978-3-030-00063-9_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-00063-9_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-00062-2

  • Online ISBN: 978-3-030-00063-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics