An Approach to Extracting Topic-guided Views from the Sources of a Data Lake

Diamantini, Claudia; Lo Giudice, Paolo; Potena, Domenico; Storti, Emanuele; Ursino, Domenico

doi:10.1007/s10796-020-10010-x

An Approach to Extracting Topic-guided Views from the Sources of a Data Lake

Published: 24 May 2020

Volume 23, pages 243–262, (2021)
Cite this article

Information Systems Frontiers Aims and scope Submit manuscript

Claudia Diamantini¹,
Paolo Lo Giudice²,
Domenico Potena¹,
Emanuele Storti¹ &
…
Domenico Ursino¹

767 Accesses
16 Citations
Explore all metrics

Abstract

In the last years, data lakes are emerging as an effective and an efficient support for information and knowledge extraction from a huge amount of highly heterogeneous and quickly changing data sources. Data lake management requires the definition of new techniques, very different from the ones adopted for data warehouses in the past. In this scenario, one of the most challenging issues to address consists in the extraction of topic-guided (i.e., thematic) views from the (very heterogeneous and often unstructured) sources of a data lake. In this paper, we propose a new network-based model to uniformly represent structured, semi-structured and unstructured sources of a data lake. Then, we present a new approach to, at least partially, “structuring” unstructured data. Finally, we define a technique to extract topic-guided views from the sources of a data lake, based on similarity and other semantic relationships among source metadata.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Extracting Topics from Semi-structured Data for Enhancing Enterprise Knowledge Graphs

User-Friendly Exploration of Highly Heterogeneous Data Lakes

A survey on semantic schema discovery

Article 27 November 2021

Notes

http://dbpedia.org/
https://www.zaloni.com/
Recall that, in database context, a view is the result of a query or a more complex extraction process that can be exploited by users for further computations.
http://www.opencalais.com
Here and in the following, to make the presentation smoother, we use the term “source” (resp., “keyword”) to denote both the source (resp., a keyword) and the corresponding node associated with it.
In this paper, we use the term “lemma” according to the meaning it has in BabelNet (Navigli and Ponzetto 2012). Here, given a term, its lemmas are other objects (terms, emoticons, etc.) that contribute to specify its meaning.
Note that Phases 2 and 4 could be merged into a unique one, avoiding to define arcs with label “lemmaOf”. Here, we maintain these arcs and both phases to keep the information about similarity between nodes for future uses.
Whenever this does not happen, the mapping can be automatically provided by the DBpedia Lookup Service (http://wiki.dbpedia.org/projects/dbpedia-lookup).
Here, two nodes are assumed to be equal if the corresponding names coincide.
In Figs. 3 and 4, we do not show the arc labels for the sources C, W and E because all of them are “contains” and their presence would have complicated the layout unnecessarily.
Hereafter, we use the notation S.o to indicate the object o of the source S.
In this figure, for layout reasons, we do not show the arc labels because they are the same as the ones of the corresponding arcs of Figs. 3, 4 and 5.
Prefixes dbo and dbr stand for http://dbpedia.org/ontology/ and http://dbpedia.org/resource/
Consider that, since we have 20 real sources in the data lakes adopted in our experimental campaign, the value of H_j can range in the real interval [0.05, 20].
As a matter of fact, a topic set with 8 keywords would encompass a great number of different concepts and, as such, it would not be generally able to capture a clear and specific desire of a user.

References

Abiteboul, S., & Duschka, O. (1998). Complexity of answering queries using materialized views. In Proc. of the International Symposium on Principles of Database Systems (SIGMOD/PODS’98) (pp. 254– 263). Seattle: ACM.
Aversano, L., Intonti, R., Quattrocchi, C., & Tortorella, M. (2010). Building a virtual view of heterogeneous data source views. In Proc. of the International Conference on Software and Data Technologies (ICSOFT’10) (pp. 266–275). Athens: INSTICC Press.
Bachtarzi, C., & Bachtarzi, F. (2015). A model-driven approach for materialized views definition over heterogeneous databases. In Proc. of the International Conference on New Technologies of Information and Communication (NTIC’15) (pp. 1–5). Mila: IEEE.
Bergamaschi, S., Castano, S., Vincini, M., & Beneventano, D. (2001). Semantic integration and query of heterogeneous information sources. Data & Knowledge Engineering, 36(3), 215–249.
Article Google Scholar
Bidoit, N., Colazzo, D., Malla, N., & Sartiani, C. (2018). Evaluating queries and updates on big xml documents. Information Systems Frontiers, 20(1), 63–90.
Article Google Scholar
Bilalli, B., Abelló, A., Aluja-Banet, T., & Wrembel, R. (2016). Towards intelligent data analysis: the metadata challenge. In Proc. of the International Conference on Internet of Things and Big Data (ioTBD’16) (pp. 331–338). Rome, Italy.
Biskup, J., & Embley, D. (2003). Extracting information from heterogeneous information sources using ontologically specified target views. Information Systems, 28(3), 169–212. Elsevier.
Article Google Scholar
Blei, D., Ng, A., & Jordan, M. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022. Microtone Publishing.
Google Scholar
Bouadjenek, M.R., Hacid, H., & Bouzeghoub, M. (2016). Social networks and information retrieval, how are they converging? A survey, a taxonomy and an analysis of social information retrieval approaches and platforms. Information Systems, 56, 1–18.
Article Google Scholar
Bougouin, A., Boudin, F., & Daille, B. (2013). Topicrank: Graph-based topic ranking for keyphrase extraction. In Proc.of the International Joint Conference on Natural Language Processing (IJCNLP’13) (pp. 543–551). Nagoya: Asian Federation of Natural Language Processing.
Brackenbury, W., Liu, R., Mondal, M., Elmore, A., Ur, B., Chard, K., & Franklin, M. (2018). Draining the data swamp: A similarity-based approach. In Proc. of the International Workshop on Human-in-the-loop Data Analytics (HILDA’18) (p. 13). Houston: ACM.
Campos, R., Mangaravite, V., Pasquali, A., Jorge, A., Nunes, C., & Jatowt, A. (2020). YAKE! Keyword extraction from single documents using multiple local features. Information Sciences, 509, 257–289. Elsevier.
Article Google Scholar
Castano, S., & Antonellis, V.D. (1999). Building views over semistructured data sources. In Proc. of the International Conference on Conceptual Modeling (ER’99) (pp. 146–160). Paris: Springer.
Chen, C., Shyu, M.-L., & Chen, S.-C. (2016). Weighted subspace modeling for semantic concept retrieval using gaussian mixture models. Information Systems Frontiers, 18(5), 877–889.
Article Google Scholar
Corbellini, A., Mateos, C., Zunino, A., Godoy, D., & Schiaffino, S. (2017). Persisting big-data: The NoSQL landscape. Information Systems, 63, 1–23. Elsevier.
Article Google Scholar
De Meo, P., Quattrone, G., Terracina, G., & Ursino, D. (2006). Integration of XML Schemas at various “severity” levels. Information Systems, 31(6), 397–434.
Article Google Scholar
Debattista, J., Lange, C., & Auer, S. (2014). Representing dataset quality metadata using multi-dimensional views. In Proc. of the International Conference on Semantic Systems (SEM’14) (pp. 92–99). Leipzig: ACM.
Dessi, A., & Atzori, M. (2016). A machine-learning approach to ranking rdf properties. Future Generation Computer Systems, 54, 366–377.
Article Google Scholar
Dublin Core Metadata Initiative. (2012). DCMI Metadata Terms. Technical report.
Fan, W., Wang, X., & Wu, Y. (2016). Answering pattern queries using views. IEEE Transactions on Knowledge and Data Engineering, 28(2), 326–341. IEEE.
Article Google Scholar
Fang, H. (2015). Managing data lakes in big data era: What’s a data lake and why has it became popular in data management ecosystem. In Proc. of the International Conference on Cyber Technology in Automation (CYBER’15) (pp. 820–824). Shenyang: IEEE.
Farid, M., Roatis, A., Ilyas, I., Hoffmann, H., & Chu, X. (2016). CLAMS: bringing quality to data lakes. In Proc. of the International Conference on Management of Data (SIGMOD/PODS’16) (pp. 2089–2092). San Francisco: ACM.
García-Moya, L., Kudama, S., Aramburu, M., & Berlanga, R. (2013). Storing and analysing voice of the market data in the corporate data warehouse. Information Systems Frontiers, 15(3), 331–349.
Article Google Scholar
Hai, R., Geisler, S., & Quix, C. (2016). Constance: an intelligent data lake system. In Proc. of the International Conference on Management of Data (SIGMOD 2016) (pp. 2097–2100). San Francisco: ACM.
Hai, R., Quix, C., & Zhou, C. (2018). Query rewriting for heterogeneous data lakes. In Proc. of the International Conference on European Conference on Advances in Databases and Information Systems(ADBIS’18) (pp. 35–49). Budapest: Springer.
Halevy, A. (2001). Answering queries using views: A survey. The VLDB Journal, 10(4), 270–294. Springer.
Article Google Scholar
Hamadou, H., & Ghozzi, F. (2018). Querying heterogeneous document stores. In Proc. of the International Conference on Enterprise Information Systems (ICEIS’18) (pp. 58–68). Madeira, Portugal.
Heath, T., & Bizer, C. (2011). Linked data:, Evolving the web into a global data space. Synthesis lectures on the semantic web: theory and technology, 1(1), 1–136.
Article Google Scholar
Hirschman, A. (1964). The paternity of an index. The American Economic Review, 54(5), 761–762.
Google Scholar
Hitzler, P., & Janowicz, K. (2013). Linked data, big data, and the 4th paradigm. Semantic Web, 4(3), 233–235.
Article Google Scholar
Janjua, N., Hussain, F., & Hussain, O. (2013). Semantic information and knowledge integration through argumentative reasoning to support intelligent decision making. Information Systems Frontiers, 15(2), 167–192.
Article Google Scholar
Keith, A., Cyganiak, R., Hausenblas, M., & Zhao, J. (2011). Describing linked datasets with the void vocabulary. Technical report.
Klettke, M., Awolin, H., Storl, U., Muller, D., & Scherzinger, S. (2017). Uncovering the evolution history of data lakes. In Proc. of the International Conference on Big data (IEEE bigdata 2017) (pp. 2462–2471). Boston: IEEE.
Kondrak, G. (2005). N-gram similarity and distance. In String processing and Information Retrieval (pp. 115–126): Springer.
Konstantinou, N., Koehler, M., Abel, E., Civili, C., Neumayr, B., Sallinger, E., Fernandes, A., Gottlob, G., Keane, J., & Libkin, L. (2017). The VADA architecture for cost-effective data wrangling. In Proc. of the International Conference on Management of Data (SIGMOD’17) (pp. 1599–1602). Chicago: ACM.
Lassila, O., Swick, R.R., & et al. (1998). Resource description framework (rdf) model and syntax specification.
Maccioni, A., & Torlone, R. (2018). KAYAK: a framework for just-in-time data preparation in a data lake. In Proc. of the international Conference on Advanced information Systems Engineering (CAiSE’18) (pp. 474–489). Tallinn: Springer.
Madhavan, J., Bernstein, P., & Rahm, E. (2001). Generic schema matching with Cupid. In Proc.of the international conference on very large data bases (VLDB 2001) (pp. 49–58). Morgan Kaufmann: Rome.
McPherson, M., Smith-Lovin, L., & Cook, J. (2001). Birds of a feather: Homophily in social networks. Annual Review of Sociology, 27, 415–444. JSTOR.
Article Google Scholar
Mouttham, A., Kuziemsky, C., Langayan, D., Peyton, L., & Pereira, J. (2012). Interoperable support for collaborative, mobile, and accessible health care. Information Systems Frontiers, 14(1), 73–85.
Article Google Scholar
Mouzakitis, S., Papaspyros, D., Petychakis, M., Koussouris, S., Zafeiropoulos, A., Fotopoulou, E., Farid, L., Orlandi, F., Attard, J., & Psarras, J. (2017). Challenges and opportunities in renovating public sector information by enabling linked data and analytics. Information Systems Frontiers, 19(2), 321–336.
Article Google Scholar
Tsvetovat, M., & Kouznetsov, A. (2011). Social Network Analysis for startups: Finding connections on the social web. O’Reilly Media Inc.
Navigli, R., & Ponzetto, S. (2012). BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artificial Intelligence, 193, 217–250. Elsevier.
Article Google Scholar
Oram, A. (2015). Managing the Data Lake Sebastopol. O’Reilly: USA.
Google Scholar
Palopoli, L., Pontieri, L., Terracina, G., & Ursino, D. (2000). Intensional and extensional integration and abstraction of heterogeneous databases. Data & Knowledge Engineering, 35(3), 201–237.
Article Google Scholar
Palopoli, L., Saccà, D., Terracina, G., & Ursino, D. (2003a). Uniform techniques for deriving similarities of objects and subschemes in heterogeneous databases. IEEE Transactions on Knowledge and Data Engineering, 15 (2), 271–294.
Article Google Scholar
Palopoli, L., Terracina, G., & Ursino, D. (2001). A graph-based approach for extracting terminological properties of elements of XML documents. In Proc. of the International Conference on Data Engineering (ICDE 2001) (pp. 330–337). Heidelberg: IEEE Computer Society.
Palopoli, L., Terracina, G., & Ursino, D. (2003b). DIKE: A system supporting the semi-automatic construction of Cooperative Information Systems from heterogeneous databases. Software Practice & Experience, 33(9), 847–884.
Article Google Scholar
Palopoli, L., Terracina, G., & Ursino, D. (2003c). Experiences using DIKE, a system for supporting cooperative information system and data warehouse design. Information Systems, 28(7), 835–865.
Article Google Scholar
Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010). Automatic keyword extraction from individual documents. Text Mining: Applications and Theory, 1, 1–20. Wiley, New York.
Google Scholar
Singh, K., & Singh, V. (2016). Answering graph pattern query using incremental views. In Proc.of the international conference on computing (ICCCA’16) (pp. 54–59). Greater Noida: IEEE.
Spink, A., Wolfram, D., Jansen, M.B.J., & Saracevic, T. (2001). Searching the web: the public and their queries. Journal of the American Society for Information Science and Technology, 52(3), 226–234.
Article Google Scholar
Wang, J., Li, J., & Yu, J. (2011). Answering tree pattern queries using views: a revisit. In Proc.of the international conference on extending database technology (EDBT/ICDT’11) (pp. 153–164). Uppsala: ACM.
Wang, J., & Yu, J. (2012). Revisiting answering tree pattern queries using views. ACM Transactions on Database Systems, 37(3), 18. ACM.
Article Google Scholar
Wu, X., Theodoratos, D., & Wang, W. (2009). Answering XML queries using materialized views revisited. In Proc. of the International Conference on Information and Knowledge Management (CIKM ’09) (pp. 475–484). Hong Kong: ACM.
Yi, J., Maghoul, F., & Pedersen, J. (2008). Deciphering mobile search patterns: a study of yahoo! mobile search queries. In Proceedings of the 17th International Conference on World Wide Web, WWW ’08 (pp. 257–266). New York: ACM.

Download references

Author information

Authors and Affiliations

DII, Polytechnic University of Marche, Ancona, Italy
Claudia Diamantini, Domenico Potena, Emanuele Storti & Domenico Ursino
DIIES, University “Mediterranea” of Reggio Calabria, Reggio Calabria, Italy
Paolo Lo Giudice

Authors

Claudia Diamantini
View author publications
You can also search for this author in PubMed Google Scholar
Paolo Lo Giudice
View author publications
You can also search for this author in PubMed Google Scholar
Domenico Potena
View author publications
You can also search for this author in PubMed Google Scholar
Emanuele Storti
View author publications
You can also search for this author in PubMed Google Scholar
Domenico Ursino
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Claudia Diamantini.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Diamantini, C., Lo Giudice, P., Potena, D. et al. An Approach to Extracting Topic-guided Views from the Sources of a Data Lake. Inf Syst Front 23, 243–262 (2021). https://doi.org/10.1007/s10796-020-10010-x

Download citation

Published: 24 May 2020
Issue Date: February 2021
DOI: https://doi.org/10.1007/s10796-020-10010-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Approach to Extracting Topic-guided Views from the Sources of a Data Lake

Abstract

Access this article

Similar content being viewed by others

Extracting Topics from Semi-structured Data for Enhancing Enterprise Knowledge Graphs

User-Friendly Exploration of Highly Heterogeneous Data Lakes

A survey on semantic schema discovery

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An Approach to Extracting Topic-guided Views from the Sources of a Data Lake

Abstract

Access this article

Similar content being viewed by others

Extracting Topics from Semi-structured Data for Enhancing Enterprise Knowledge Graphs

User-Friendly Exploration of Highly Heterogeneous Data Lakes

A survey on semantic schema discovery

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation