Abstract
In the last years, data lakes are emerging as an effective and an efficient support for information and knowledge extraction from a huge amount of highly heterogeneous and quickly changing data sources. Data lake management requires the definition of new techniques, very different from the ones adopted for data warehouses in the past. In this scenario, one of the most challenging issues to address consists in the extraction of topic-guided (i.e., thematic) views from the (very heterogeneous and often unstructured) sources of a data lake. In this paper, we propose a new network-based model to uniformly represent structured, semi-structured and unstructured sources of a data lake. Then, we present a new approach to, at least partially, “structuring” unstructured data. Finally, we define a technique to extract topic-guided views from the sources of a data lake, based on similarity and other semantic relationships among source metadata.
Similar content being viewed by others
Notes
Recall that, in database context, a view is the result of a query or a more complex extraction process that can be exploited by users for further computations.
Here and in the following, to make the presentation smoother, we use the term “source” (resp., “keyword”) to denote both the source (resp., a keyword) and the corresponding node associated with it.
In this paper, we use the term “lemma” according to the meaning it has in BabelNet (Navigli and Ponzetto 2012). Here, given a term, its lemmas are other objects (terms, emoticons, etc.) that contribute to specify its meaning.
Note that Phases 2 and 4 could be merged into a unique one, avoiding to define arcs with label “lemmaOf”. Here, we maintain these arcs and both phases to keep the information about similarity between nodes for future uses.
Whenever this does not happen, the mapping can be automatically provided by the DBpedia Lookup Service (http://wiki.dbpedia.org/projects/dbpedia-lookup).
Here, two nodes are assumed to be equal if the corresponding names coincide.
Hereafter, we use the notation S.o to indicate the object o of the source S.
Prefixes dbo and dbr stand for http://dbpedia.org/ontology/ and http://dbpedia.org/resource/
Consider that, since we have 20 real sources in the data lakes adopted in our experimental campaign, the value of Hj can range in the real interval [0.05, 20].
As a matter of fact, a topic set with 8 keywords would encompass a great number of different concepts and, as such, it would not be generally able to capture a clear and specific desire of a user.
References
Abiteboul, S., & Duschka, O. (1998). Complexity of answering queries using materialized views. In Proc. of the International Symposium on Principles of Database Systems (SIGMOD/PODS’98) (pp. 254– 263). Seattle: ACM.
Aversano, L., Intonti, R., Quattrocchi, C., & Tortorella, M. (2010). Building a virtual view of heterogeneous data source views. In Proc. of the International Conference on Software and Data Technologies (ICSOFT’10) (pp. 266–275). Athens: INSTICC Press.
Bachtarzi, C., & Bachtarzi, F. (2015). A model-driven approach for materialized views definition over heterogeneous databases. In Proc. of the International Conference on New Technologies of Information and Communication (NTIC’15) (pp. 1–5). Mila: IEEE.
Bergamaschi, S., Castano, S., Vincini, M., & Beneventano, D. (2001). Semantic integration and query of heterogeneous information sources. Data & Knowledge Engineering, 36(3), 215–249.
Bidoit, N., Colazzo, D., Malla, N., & Sartiani, C. (2018). Evaluating queries and updates on big xml documents. Information Systems Frontiers, 20(1), 63–90.
Bilalli, B., Abelló, A., Aluja-Banet, T., & Wrembel, R. (2016). Towards intelligent data analysis: the metadata challenge. In Proc. of the International Conference on Internet of Things and Big Data (ioTBD’16) (pp. 331–338). Rome, Italy.
Biskup, J., & Embley, D. (2003). Extracting information from heterogeneous information sources using ontologically specified target views. Information Systems, 28(3), 169–212. Elsevier.
Blei, D., Ng, A., & Jordan, M. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022. Microtone Publishing.
Bouadjenek, M.R., Hacid, H., & Bouzeghoub, M. (2016). Social networks and information retrieval, how are they converging? A survey, a taxonomy and an analysis of social information retrieval approaches and platforms. Information Systems, 56, 1–18.
Bougouin, A., Boudin, F., & Daille, B. (2013). Topicrank: Graph-based topic ranking for keyphrase extraction. In Proc.of the International Joint Conference on Natural Language Processing (IJCNLP’13) (pp. 543–551). Nagoya: Asian Federation of Natural Language Processing.
Brackenbury, W., Liu, R., Mondal, M., Elmore, A., Ur, B., Chard, K., & Franklin, M. (2018). Draining the data swamp: A similarity-based approach. In Proc. of the International Workshop on Human-in-the-loop Data Analytics (HILDA’18) (p. 13). Houston: ACM.
Campos, R., Mangaravite, V., Pasquali, A., Jorge, A., Nunes, C., & Jatowt, A. (2020). YAKE! Keyword extraction from single documents using multiple local features. Information Sciences, 509, 257–289. Elsevier.
Castano, S., & Antonellis, V.D. (1999). Building views over semistructured data sources. In Proc. of the International Conference on Conceptual Modeling (ER’99) (pp. 146–160). Paris: Springer.
Chen, C., Shyu, M.-L., & Chen, S.-C. (2016). Weighted subspace modeling for semantic concept retrieval using gaussian mixture models. Information Systems Frontiers, 18(5), 877–889.
Corbellini, A., Mateos, C., Zunino, A., Godoy, D., & Schiaffino, S. (2017). Persisting big-data: The NoSQL landscape. Information Systems, 63, 1–23. Elsevier.
De Meo, P., Quattrone, G., Terracina, G., & Ursino, D. (2006). Integration of XML Schemas at various “severity” levels. Information Systems, 31(6), 397–434.
Debattista, J., Lange, C., & Auer, S. (2014). Representing dataset quality metadata using multi-dimensional views. In Proc. of the International Conference on Semantic Systems (SEM’14) (pp. 92–99). Leipzig: ACM.
Dessi, A., & Atzori, M. (2016). A machine-learning approach to ranking rdf properties. Future Generation Computer Systems, 54, 366–377.
Dublin Core Metadata Initiative. (2012). DCMI Metadata Terms. Technical report.
Fan, W., Wang, X., & Wu, Y. (2016). Answering pattern queries using views. IEEE Transactions on Knowledge and Data Engineering, 28(2), 326–341. IEEE.
Fang, H. (2015). Managing data lakes in big data era: What’s a data lake and why has it became popular in data management ecosystem. In Proc. of the International Conference on Cyber Technology in Automation (CYBER’15) (pp. 820–824). Shenyang: IEEE.
Farid, M., Roatis, A., Ilyas, I., Hoffmann, H., & Chu, X. (2016). CLAMS: bringing quality to data lakes. In Proc. of the International Conference on Management of Data (SIGMOD/PODS’16) (pp. 2089–2092). San Francisco: ACM.
García-Moya, L., Kudama, S., Aramburu, M., & Berlanga, R. (2013). Storing and analysing voice of the market data in the corporate data warehouse. Information Systems Frontiers, 15(3), 331–349.
Hai, R., Geisler, S., & Quix, C. (2016). Constance: an intelligent data lake system. In Proc. of the International Conference on Management of Data (SIGMOD 2016) (pp. 2097–2100). San Francisco: ACM.
Hai, R., Quix, C., & Zhou, C. (2018). Query rewriting for heterogeneous data lakes. In Proc. of the International Conference on European Conference on Advances in Databases and Information Systems(ADBIS’18) (pp. 35–49). Budapest: Springer.
Halevy, A. (2001). Answering queries using views: A survey. The VLDB Journal, 10(4), 270–294. Springer.
Hamadou, H., & Ghozzi, F. (2018). Querying heterogeneous document stores. In Proc. of the International Conference on Enterprise Information Systems (ICEIS’18) (pp. 58–68). Madeira, Portugal.
Heath, T., & Bizer, C. (2011). Linked data:, Evolving the web into a global data space. Synthesis lectures on the semantic web: theory and technology, 1(1), 1–136.
Hirschman, A. (1964). The paternity of an index. The American Economic Review, 54(5), 761–762.
Hitzler, P., & Janowicz, K. (2013). Linked data, big data, and the 4th paradigm. Semantic Web, 4(3), 233–235.
Janjua, N., Hussain, F., & Hussain, O. (2013). Semantic information and knowledge integration through argumentative reasoning to support intelligent decision making. Information Systems Frontiers, 15(2), 167–192.
Keith, A., Cyganiak, R., Hausenblas, M., & Zhao, J. (2011). Describing linked datasets with the void vocabulary. Technical report.
Klettke, M., Awolin, H., Storl, U., Muller, D., & Scherzinger, S. (2017). Uncovering the evolution history of data lakes. In Proc. of the International Conference on Big data (IEEE bigdata 2017) (pp. 2462–2471). Boston: IEEE.
Kondrak, G. (2005). N-gram similarity and distance. In String processing and Information Retrieval (pp. 115–126): Springer.
Konstantinou, N., Koehler, M., Abel, E., Civili, C., Neumayr, B., Sallinger, E., Fernandes, A., Gottlob, G., Keane, J., & Libkin, L. (2017). The VADA architecture for cost-effective data wrangling. In Proc. of the International Conference on Management of Data (SIGMOD’17) (pp. 1599–1602). Chicago: ACM.
Lassila, O., Swick, R.R., & et al. (1998). Resource description framework (rdf) model and syntax specification.
Maccioni, A., & Torlone, R. (2018). KAYAK: a framework for just-in-time data preparation in a data lake. In Proc. of the international Conference on Advanced information Systems Engineering (CAiSE’18) (pp. 474–489). Tallinn: Springer.
Madhavan, J., Bernstein, P., & Rahm, E. (2001). Generic schema matching with Cupid. In Proc.of the international conference on very large data bases (VLDB 2001) (pp. 49–58). Morgan Kaufmann: Rome.
McPherson, M., Smith-Lovin, L., & Cook, J. (2001). Birds of a feather: Homophily in social networks. Annual Review of Sociology, 27, 415–444. JSTOR.
Mouttham, A., Kuziemsky, C., Langayan, D., Peyton, L., & Pereira, J. (2012). Interoperable support for collaborative, mobile, and accessible health care. Information Systems Frontiers, 14(1), 73–85.
Mouzakitis, S., Papaspyros, D., Petychakis, M., Koussouris, S., Zafeiropoulos, A., Fotopoulou, E., Farid, L., Orlandi, F., Attard, J., & Psarras, J. (2017). Challenges and opportunities in renovating public sector information by enabling linked data and analytics. Information Systems Frontiers, 19(2), 321–336.
Tsvetovat, M., & Kouznetsov, A. (2011). Social Network Analysis for startups: Finding connections on the social web. O’Reilly Media Inc.
Navigli, R., & Ponzetto, S. (2012). BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artificial Intelligence, 193, 217–250. Elsevier.
Oram, A. (2015). Managing the Data Lake Sebastopol. O’Reilly: USA.
Palopoli, L., Pontieri, L., Terracina, G., & Ursino, D. (2000). Intensional and extensional integration and abstraction of heterogeneous databases. Data & Knowledge Engineering, 35(3), 201–237.
Palopoli, L., Saccà, D., Terracina, G., & Ursino, D. (2003a). Uniform techniques for deriving similarities of objects and subschemes in heterogeneous databases. IEEE Transactions on Knowledge and Data Engineering, 15 (2), 271–294.
Palopoli, L., Terracina, G., & Ursino, D. (2001). A graph-based approach for extracting terminological properties of elements of XML documents. In Proc. of the International Conference on Data Engineering (ICDE 2001) (pp. 330–337). Heidelberg: IEEE Computer Society.
Palopoli, L., Terracina, G., & Ursino, D. (2003b). DIKE: A system supporting the semi-automatic construction of Cooperative Information Systems from heterogeneous databases. Software Practice & Experience, 33(9), 847–884.
Palopoli, L., Terracina, G., & Ursino, D. (2003c). Experiences using DIKE, a system for supporting cooperative information system and data warehouse design. Information Systems, 28(7), 835–865.
Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010). Automatic keyword extraction from individual documents. Text Mining: Applications and Theory, 1, 1–20. Wiley, New York.
Singh, K., & Singh, V. (2016). Answering graph pattern query using incremental views. In Proc.of the international conference on computing (ICCCA’16) (pp. 54–59). Greater Noida: IEEE.
Spink, A., Wolfram, D., Jansen, M.B.J., & Saracevic, T. (2001). Searching the web: the public and their queries. Journal of the American Society for Information Science and Technology, 52(3), 226–234.
Wang, J., Li, J., & Yu, J. (2011). Answering tree pattern queries using views: a revisit. In Proc.of the international conference on extending database technology (EDBT/ICDT’11) (pp. 153–164). Uppsala: ACM.
Wang, J., & Yu, J. (2012). Revisiting answering tree pattern queries using views. ACM Transactions on Database Systems, 37(3), 18. ACM.
Wu, X., Theodoratos, D., & Wang, W. (2009). Answering XML queries using materialized views revisited. In Proc. of the International Conference on Information and Knowledge Management (CIKM ’09) (pp. 475–484). Hong Kong: ACM.
Yi, J., Maghoul, F., & Pedersen, J. (2008). Deciphering mobile search patterns: a study of yahoo! mobile search queries. In Proceedings of the 17th International Conference on World Wide Web, WWW ’08 (pp. 257–266). New York: ACM.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Diamantini, C., Lo Giudice, P., Potena, D. et al. An Approach to Extracting Topic-guided Views from the Sources of a Data Lake. Inf Syst Front 23, 243–262 (2021). https://doi.org/10.1007/s10796-020-10010-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10796-020-10010-x