An approach to extracting complex knowledge patterns among concepts belonging to structured, semi-structured and unstructured sources in a data lake
Introduction
In the last few years, the “big data phenomenon” is rapidly changing the research and technological “coordinates” of the information system area [2], [46]. For instance, it is well known that data warehouses, generally handling structured and semi-structured data offline, are too complex and rigid to manage the wide amount and variety of rapidly evolving data sources of interest for a given organization, and the usage of more agile and flexible structures appears compulsory [9]. Data lakes are one of the most promising answers to this exigency. Differently from a data warehouse, a data lake uses a flat architecture (so that the insertion and the removal of a source can be easily performed). However, the agile and effective management of data stored therein is guaranteed by the presence of a rich set of extended metadata. These allow a very agile and easily configurable usage of the data stored in the data lake. For instance, if a given application requires the querying of some data sources, one could process available metadata to determine the portion of the involved data lake to examine.
One of the most radical changes caused by the big data phenomenon is the presence of a huge amount of unstructured data. As a matter of fact, it is esteemed that, currently, more than 80% of the information available on the Internet is unstructured [6]. In presence of unstructured data, all the approaches developed in the past for structured and semi-structured data must be “renewed”, and the new approaches will be presumably much more complex than the old ones [20], [42]. Think, for instance, of schema integration: unstructured sources do not have a representing schema and, often, only a set of keywords are given (or can be extracted) to represent the corresponding content [10].
This paper aims at providing a contribution in this setting. In particular, it proposes an approach to the extraction of complex knowledge patterns among concepts belonging to structured, semi-structured and unstructured sources in a data lake. Here, we use the term “complex knowledge pattern” to indicate an intensional relationship involving more concepts possibly belonging to different (and, presumably, heterogeneous) sources of a data lake. Formally speaking, in this paper, a complex knowledge pattern consists of a logic succession of w objects such that there is a semantic relationship (specifically, a synonymy or a part-of relationship) linking the kth and the objects of the succession.
Our approach is network-based in that it represents all the data lake sources by means of suitable networks. As a matter of facts, networks are very flexible structures, which allow the modeling of almost all phenomena that researchers aim at investigating. For instance, they have been used in the past to uniformly represent data sources characterized by heterogeneous, both structured and semi-structured, formats [8]. In this paper, we also use networks to represent unstructured sources, which, as said before, do not have a representing schema. Furthermore, we propose a technique to construct a “structured representation” of the flat keywords generally used to represent unstructured data sources. This is a fundamental task because it highly facilitates the uniform management, through our network-based model, of structured, semi-structured and unstructured data sources.
Thanks to this uniform, network-based representation of the data lake sources, the extraction of complex knowledge patterns can be performed by exploiting graph-based tools. In particular, taking into consideration our definition of complex knowledge patterns, a possible approach for their derivation could consist in the construction of suitable paths going from the first node (i.e., x1) to the last node (i.e., xw) of the succession expressing the patterns. In this case, a user specifies the “seed objects” of the pattern (i.e., x1 and xw) and our approach finds a suitable path (if it exists) linking x1 to xw.
Since x1 and xw could belong to different sources, our approach should consider the possible presence of synonymies between concepts belonging to different sources, it should model these synonymies by means of a suitable form of arcs (cross arcs, or c-arcs), and should include both intra-source arcs (inner arcs, or i-arcs) and c-arcs in the path connecting x1 to xw and representing the complex knowledge pattern of interest.
Among all the possible paths connecting x1 to xw, our approach takes the shortest one (i.e., the one with the minimum number of c-arcs and, at the same number of c-arcs, the one with the minimum number of i-arcs). This choice is motivated by observing that a knowledge pattern should be as semantically homogeneous as possible. With this in mind, it is appropriate to reduce as much as possible the number of synonymies to be considered in the knowledge pattern from x1 to xw. This because a synonymy is weaker than an identity and, furthermore, it involves objects belonging to different sources which, inevitably, causes an “impairment” in the path going from x1 to xw. Moreover, there is a further, more technological reason leading to the choice of the shortest path. Indeed, it is presumable that, after a complex knowledge pattern has been defined and validated at the intensional level, one would like to recover the corresponding data at the extensional level. In this case, in a big data scenario, reducing the number of the sources to query would generally reduce the volume and the variety of the data to process and, hence, would increase the velocity at which the data of interest are retrieved and processed.
As it will be clear in the following, there are cases in which synonymies (and, hence, c-arcs) are not sufficient to find a complex knowledge pattern from x1 to xw. In these cases, our approach performs two further attempts in which it tries to involve string similarities first, and, if even these properties are not sufficient, part-whole relationships. If neither synonymies nor string similarities nor part-whole relationships allow the construction of a path from x1 to xw, our approach concludes that, in the data lake into consideration, a complex knowledge pattern from x1 to xw does not exist.
Summarizing, the main contributions of this paper are the following:
- •
It proposes a new network-based model to represent the structured, semi-structured and unstructured sources of a data lake.
- •
It proposes a new approach to, at least partially, “structuring” unstructured sources.
- •
It proposes a new approach to extracting complex knowledge patterns from the sources of a data lake.
This paper is structured as follows: in Section 2, we illustrate related literature. In Section 3, we present our network-based model for data lakes. In Section 4, we describe our approach to enriching the representation of unstructured data sources in such a way as to, at least partially, “structure” them. In Section 5, we present our approach to the extraction of complex knowledge patterns. In Section 6, we describe some case studies conceived to illustrate the various possible behaviors of our approach. In Section 7, we present a critical discussion of several aspects concerning our approach. Finally, in Section 8, we draw our conclusions.
Section snippets
Related literature
In the literature there is a strong agreement in the definition of data lake. For instance, Hai et al. [15] define data lakes as “big data repositories which store raw data and provide functionality for on-demand integration with the help of metadata descriptions”. Terrizzano et al. [44] claim that “a data lake is a set of centralized repositories containing vast amounts of raw data (either structured or unstructured), described by metadata, organized into identifiable data sets, and available
A network-based model for data lakes
In this section, we illustrate our network-based model to represent and handle a data lake, which we will use in the rest of this paper.
In our model, a data lake DL is represented as a set of m data sources:
A data source Dk ∈ DL is provided with a rich set of metadata. We denote with the repository of the metadata of all the data sources of DL:
According to Oram [33], our model represents by means of a triplet:
Here:
- •
denotes technical
Enriching the representation of unstructured data
Our network-based model for representing and handling a data lake is perfectly fitted for representing and managing semi-structured data because it has been designed having XML and JSON in mind. Clearly, it is sufficiently powerful to represent structured data. The highest difficulty regards unstructured data because it is worth avoiding a flat representation consisting of a simple element for each keyword provided to denote the source content. As a matter of fact, this kind of representation
General description of the approach
Our approach to the extraction of complex knowledge patterns operates on a data lake DL whose data sources are represented by means of the formalism described in Section 4.
It receives a dictionary Syn of synonymies involving the objects stored in the sources of DL. This dictionary could be a generic thesaurus, such as BabelNet [32], a domain-specific thesaurus, or a dictionary obtained by taking into account the structure and the semantics of the sources, which the corresponding objects refer
Some case studies
In this section, we present some case studies devoted to illustrate the behavior of our approach in the various possible cases. To perform our test cases, we constructed a data lake consisting of 2 structured sources, 4 semi-structured sources (i.e., 2 XML sources and 2 JSON ones) and 4 unstructured sources (i.e., 2 books and 2 videos). All these sources store data about environment and pollution. To describe unstructured sources, we initially considered a set of keywords derived from Google
Discussion
This section is devoted to present a critical discussion of several aspects concerning our approach. It consists of four subsections. In the first, we present a comparison between our approach and the related ones. In the second, we evaluate the performance of our technique for structuring unstructured data. In the third, we evaluate the performance of our overall approach. Finally, in the fourth, we measure its efficiency for large datasets. To carry out the experiments described in this
Conclusion
In this paper, we have proposed a new network-based model to uniformly represent and handle structured, semi-structured and unstructured sources of a data lake. Then, we have presented a new approach to, at least partially, “structuring” unstructured sources. Furthermore, we have defined a new approach to extracting complex knowledge patterns from the sources of a data lake and we have presented some case studies showing the behavior of our approach in all the possible cases. Finally, we have
References (48)
- et al.
Semantic integration and query of heterogeneous information sources
Data Knowl. Eng.
(2001) - et al.
Persisting big-data: the NoSQL landscape
Inf. Syst.
(2017) - et al.
An approach to extracting thematic views from highly heterogeneous sources of a data lake
Atti del Ventiseiesimo Convegno Nazionale su Sistemi Evoluti per Basi di Dati (SEBD’18)
(2018) - et al.
CLAMS: bringing quality to Data Lakes
Proc. of the International Conference on Management of Data (SIGMOD/PODS’16)
(2016) - et al.
A new Social Network Analysis-based approach to extracting knowledge patterns about research activities and hubs in a set of countries
Int J Bus Innov Res
(2017) - et al.
Big data, fast data and data lake concepts
Procedia Comput. Sci.
(2016) - et al.
Representing natural language sentences in RDF graphs to derive knowledge patterns
Proc. of the International Conference on Data Engineering and Communication Technology (ICDECT’17)
(2017) - et al.
Visual Bayesian fusion to navigate a data lake
Proc. of the International Conference on Information Fusion (FUSION’16)
(2016) - et al.
Data wrangling: the challenging journey from the wild to the lake
Proc. of the International Conference on Innovative Data Systems Research (CIDR’15)
(2015) - et al.
Efficient pattern matching on big uncertain graphs
Inf. Sci.
(2016)
Data-intensive applications, challenges, techniques and technologies: a survey on Big Data
Inf. Sci.
RAISE: a whole process modeling method for unstructured data management
Proc. of the International Conference on Multimedia Big Data (BigMM’15)
Neil: extracting visual knowledge from web data
Proc. of the International Conference on Computer Vision (ICCV’13)
Keyword-based search and exploration on databases
Proc. of the International Conference on Data Engineering (ICDE’11)
Relaxation of keyword pattern graphs on RDF Data
J. Web Eng.
Integration of XML schemas at various “severity” levels
Inf. Syst.
Cost-benefit analysis of data warehouse design methodologies
Inf. Syst.
Towards social network analytics for understanding and managing enterprise data lakes
Proc. of the International Conference on Advances in Social Networks Analysis and Mining (ASONAM’16)
Visualizing and modeling unstructured data
Vis. Comput.
Keyword proximity search in complex data graphs
Proc. of the International Conference on Management of data (SIGMOD/PODS’08)
Constance: an intelligent data lake system
Proc. of the International Conference on Management of Data (SIGMOD’16)
Keyword search on RDF graphs-a query graph assembly approach
Proc. of the International Conference on Information and Knowledge Management (CIKM’17)
BLINKS: ranked keyword searches on graphs
Proc. of the International Conference on Management of Data (SIGMOD/PODS’07)
Overview of knowledge extraction techniques in five question-answering systems
Proc. of the International Conference on Intelligent Systems: Theories and Applications (SITA’14)
Cited by (28)
Blockchain and deep learning technologies for construction equipment security information management
2022, Automation in ConstructionCitation Excerpt :Dependent upon the information, managers can understand potential risks at the construction site in a project [13]. However, especially for text-recorded data, daily equipment inspection reports are often stored as unstructured or semi-structured text compared with structured sensor data, and it can be costly to analyze [14,15]. Besides, unstructured or semi-structured texts make managers difficult to retrieve valuable information in a system [16].
On exploring data lakes by finding compact, isolated clusters
2022, Information SciencesCitation Excerpt :The Web is currently the most important data source since it provides a plethora of datasets on virtually any topics. A data lake is a repository to which data engineers dump as many datasets as possible in an attempt not to miss any chances to infer new valuable knowledge [41,25,33,15]. It is then not surprising that many IT providers are competing to devise technologies that help data engineers work with their data lakes [22,4,34].
Advanced big-data/machine-learning techniques for optimization and performance enhancement of the heat pipe technology – A review and prospective study
2021, Applied EnergyCitation Excerpt :The basic characteristic is that they are “self-describing”, which means that the information generally associated with the schema is specified directly within data. Different forms of big data are compared in Table 4 [69,70]. The big data technologies [71] represent the technologies to efficiently deal with huge data feeds, due to the capability to process data in a variety of environments, e.g., batch, and stream.
Proposal and application of a framework to measure the degree of maturity in Quality 4.0: A multiple case study
2020, Advances in Mathematics for Industry 4.0Bibliographically coupled patents: Their temporal pattern and combined relevance
2019, Journal of InformetricsCitation Excerpt :Researchers have been using terms “relatedness,” “similarity,” and “proximity” along with terms “technology,” “patent,” and “knowledge” in different contexts to refer to similar concepts. It seems to us that, for specific features of patents, “similarity” is more often used, such as “keyword similarity” (cf. Giudice, Musarella, Sofo, & Ursino, 2019) and “classification similarity” (cf. Kuan, Chiu, Liu, Huang, & Chen, 2018); “proximity” is more often applied in knowledge spillover context and usually at firm or higher aggregate level (cf. Chu, Tian, & Wang, 2018). This study chooses to use “relatedness” as it seems to be a broader and more abstract concept.
Methodology and mechanisms for federation of heterogeneous metadata sources and ontology development in emerging collaborative environment
2023, VINE Journal of Information and Knowledge Management Systems