Elsevier

Information Sciences

Volume 478, April 2019, Pages 606-626
Information Sciences

An approach to extracting complex knowledge patterns among concepts belonging to structured, semi-structured and unstructured sources in a data lake

https://doi.org/10.1016/j.ins.2018.11.052Get rights and content

Abstract

In this paper, we propose a new network-based model to uniformly represent the structured, semi-structured and unstructured sources of a data lake, which is one of the newest and most successful architectures proposed for managing big data. Then, we present a new approach to, at least partially, “structuring” unstructured sources. Finally, with the support of these two tools, we define a new approach to extracting complex knowledge patterns from the data stored in a data lake.

Introduction

In the last few years, the “big data phenomenon” is rapidly changing the research and technological “coordinates” of the information system area [2], [46]. For instance, it is well known that data warehouses, generally handling structured and semi-structured data offline, are too complex and rigid to manage the wide amount and variety of rapidly evolving data sources of interest for a given organization, and the usage of more agile and flexible structures appears compulsory [9]. Data lakes are one of the most promising answers to this exigency. Differently from a data warehouse, a data lake uses a flat architecture (so that the insertion and the removal of a source can be easily performed). However, the agile and effective management of data stored therein is guaranteed by the presence of a rich set of extended metadata. These allow a very agile and easily configurable usage of the data stored in the data lake. For instance, if a given application requires the querying of some data sources, one could process available metadata to determine the portion of the involved data lake to examine.

One of the most radical changes caused by the big data phenomenon is the presence of a huge amount of unstructured data. As a matter of fact, it is esteemed that, currently, more than 80% of the information available on the Internet is unstructured [6]. In presence of unstructured data, all the approaches developed in the past for structured and semi-structured data must be “renewed”, and the new approaches will be presumably much more complex than the old ones [20], [42]. Think, for instance, of schema integration: unstructured sources do not have a representing schema and, often, only a set of keywords are given (or can be extracted) to represent the corresponding content [10].

This paper aims at providing a contribution in this setting. In particular, it proposes an approach to the extraction of complex knowledge patterns among concepts belonging to structured, semi-structured and unstructured sources in a data lake. Here, we use the term “complex knowledge pattern” to indicate an intensional relationship involving more concepts possibly belonging to different (and, presumably, heterogeneous) sources of a data lake. Formally speaking, in this paper, a complex knowledge pattern consists of a logic succession {x1,x2,,xw} of w objects such that there is a semantic relationship (specifically, a synonymy or a part-of relationship) linking the kth and the (k+1)th objects (1kw1) of the succession.

Our approach is network-based in that it represents all the data lake sources by means of suitable networks. As a matter of facts, networks are very flexible structures, which allow the modeling of almost all phenomena that researchers aim at investigating. For instance, they have been used in the past to uniformly represent data sources characterized by heterogeneous, both structured and semi-structured, formats [8]. In this paper, we also use networks to represent unstructured sources, which, as said before, do not have a representing schema. Furthermore, we propose a technique to construct a “structured representation” of the flat keywords generally used to represent unstructured data sources. This is a fundamental task because it highly facilitates the uniform management, through our network-based model, of structured, semi-structured and unstructured data sources.

Thanks to this uniform, network-based representation of the data lake sources, the extraction of complex knowledge patterns can be performed by exploiting graph-based tools. In particular, taking into consideration our definition of complex knowledge patterns, a possible approach for their derivation could consist in the construction of suitable paths going from the first node (i.e., x1) to the last node (i.e., xw) of the succession expressing the patterns. In this case, a user specifies the “seed objects” of the pattern (i.e., x1 and xw) and our approach finds a suitable path (if it exists) linking x1 to xw.

Since x1 and xw could belong to different sources, our approach should consider the possible presence of synonymies between concepts belonging to different sources, it should model these synonymies by means of a suitable form of arcs (cross arcs, or c-arcs), and should include both intra-source arcs (inner arcs, or i-arcs) and c-arcs in the path connecting x1 to xw and representing the complex knowledge pattern of interest.

Among all the possible paths connecting x1 to xw, our approach takes the shortest one (i.e., the one with the minimum number of c-arcs and, at the same number of c-arcs, the one with the minimum number of i-arcs). This choice is motivated by observing that a knowledge pattern should be as semantically homogeneous as possible. With this in mind, it is appropriate to reduce as much as possible the number of synonymies to be considered in the knowledge pattern from x1 to xw. This because a synonymy is weaker than an identity and, furthermore, it involves objects belonging to different sources which, inevitably, causes an “impairment” in the path going from x1 to xw. Moreover, there is a further, more technological reason leading to the choice of the shortest path. Indeed, it is presumable that, after a complex knowledge pattern has been defined and validated at the intensional level, one would like to recover the corresponding data at the extensional level. In this case, in a big data scenario, reducing the number of the sources to query would generally reduce the volume and the variety of the data to process and, hence, would increase the velocity at which the data of interest are retrieved and processed.

As it will be clear in the following, there are cases in which synonymies (and, hence, c-arcs) are not sufficient to find a complex knowledge pattern from x1 to xw. In these cases, our approach performs two further attempts in which it tries to involve string similarities first, and, if even these properties are not sufficient, part-whole relationships. If neither synonymies nor string similarities nor part-whole relationships allow the construction of a path from x1 to xw, our approach concludes that, in the data lake into consideration, a complex knowledge pattern from x1 to xw does not exist.

Summarizing, the main contributions of this paper are the following:

  • It proposes a new network-based model to represent the structured, semi-structured and unstructured sources of a data lake.

  • It proposes a new approach to, at least partially, “structuring” unstructured sources.

  • It proposes a new approach to extracting complex knowledge patterns from the sources of a data lake.

This paper is structured as follows: in Section 2, we illustrate related literature. In Section 3, we present our network-based model for data lakes. In Section 4, we describe our approach to enriching the representation of unstructured data sources in such a way as to, at least partially, “structure” them. In Section 5, we present our approach to the extraction of complex knowledge patterns. In Section 6, we describe some case studies conceived to illustrate the various possible behaviors of our approach. In Section 7, we present a critical discussion of several aspects concerning our approach. Finally, in Section 8, we draw our conclusions.

Section snippets

Related literature

In the literature there is a strong agreement in the definition of data lake. For instance, Hai et al. [15] define data lakes as “big data repositories which store raw data and provide functionality for on-demand integration with the help of metadata descriptions”. Terrizzano et al. [44] claim that “a data lake is a set of centralized repositories containing vast amounts of raw data (either structured or unstructured), described by metadata, organized into identifiable data sets, and available

A network-based model for data lakes

In this section, we illustrate our network-based model to represent and handle a data lake, which we will use in the rest of this paper.

In our model, a data lake DL is represented as a set of m data sources:

DL={D1,D2,,Dm}

A data source Dk ∈ DL is provided with a rich set Mk of metadata. We denote with MDL the repository of the metadata of all the data sources of DL:

MDL={M1,M2,,Mm}

According to Oram [33], our model represents Mk by means of a triplet:

Mk=MkT,MkO,MkB

Here:

  • MkT denotes technical

Enriching the representation of unstructured data

Our network-based model for representing and handling a data lake is perfectly fitted for representing and managing semi-structured data because it has been designed having XML and JSON in mind. Clearly, it is sufficiently powerful to represent structured data. The highest difficulty regards unstructured data because it is worth avoiding a flat representation consisting of a simple element for each keyword provided to denote the source content. As a matter of fact, this kind of representation

General description of the approach

Our approach to the extraction of complex knowledge patterns operates on a data lake DL whose data sources are represented by means of the formalism described in Section 4.

It receives a dictionary Syn of synonymies involving the objects stored in the sources of DL. This dictionary could be a generic thesaurus, such as BabelNet [32], a domain-specific thesaurus, or a dictionary obtained by taking into account the structure and the semantics of the sources, which the corresponding objects refer

Some case studies

In this section, we present some case studies devoted to illustrate the behavior of our approach in the various possible cases. To perform our test cases, we constructed a data lake consisting of 2 structured sources, 4 semi-structured sources (i.e., 2 XML sources and 2 JSON ones) and 4 unstructured sources (i.e., 2 books and 2 videos). All these sources store data about environment and pollution. To describe unstructured sources, we initially considered a set of keywords derived from Google

Discussion

This section is devoted to present a critical discussion of several aspects concerning our approach. It consists of four subsections. In the first, we present a comparison between our approach and the related ones. In the second, we evaluate the performance of our technique for structuring unstructured data. In the third, we evaluate the performance of our overall approach. Finally, in the fourth, we measure its efficiency for large datasets. To carry out the experiments described in this

Conclusion

In this paper, we have proposed a new network-based model to uniformly represent and handle structured, semi-structured and unstructured sources of a data lake. Then, we have presented a new approach to, at least partially, “structuring” unstructured sources. Furthermore, we have defined a new approach to extracting complex knowledge patterns from the sources of a data lake and we have presented some case studies showing the behavior of our approach in all the possible cases. Finally, we have

References (48)

  • C. Chen et al.

    Data-intensive applications, challenges, techniques and technologies: a survey on Big Data

    Inf. Sci.

    (2014)
  • L. Chen et al.

    RAISE: a whole process modeling method for unstructured data management

    Proc. of the International Conference on Multimedia Big Data (BigMM’15)

    (2015)
  • X. Chen et al.

    Neil: extracting visual knowledge from web data

    Proc. of the International Conference on Computer Vision (ICCV’13)

    (2013)
  • Y. Chen et al.

    Keyword-based search and exploration on databases

    Proc. of the International Conference on Data Engineering (ICDE’11)

    (2011)
  • A. Dass et al.

    Relaxation of keyword pattern graphs on RDF Data

    J. Web Eng.

    (2017)
  • P. De Meo et al.

    Integration of XML schemas at various “severity” levels

    Inf. Syst.

    (2006)
  • F. Di Tria et al.

    Cost-benefit analysis of data warehouse design methodologies

    Inf. Syst.

    (2017)
  • A. Farrugia et al.

    Towards social network analytics for understanding and managing enterprise data lakes

    Proc. of the International Conference on Advances in Social Networks Analysis and Mining (ASONAM’16)

    (2016)
  • T. Foley et al.

    Visualizing and modeling unstructured data

    Vis. Comput.

    (1993)
  • K. Golenberg et al.

    Keyword proximity search in complex data graphs

    Proc. of the International Conference on Management of data (SIGMOD/PODS’08)

    (2008)
  • R. Hai et al.

    Constance: an intelligent data lake system

    Proc. of the International Conference on Management of Data (SIGMOD’16)

    (2016)
  • S. Han et al.

    Keyword search on RDF graphs-a query graph assembly approach

    Proc. of the International Conference on Information and Knowledge Management (CIKM’17)

    (2017)
  • H. He et al.

    BLINKS: ranked keyword searches on graphs

    Proc. of the International Conference on Management of Data (SIGMOD/PODS’07)

    (2007)
  • F. Jebbor et al.

    Overview of knowledge extraction techniques in five question-answering systems

    Proc. of the International Conference on Intelligent Systems: Theories and Applications (SITA’14)

    (2014)
  • Cited by (28)

    • Blockchain and deep learning technologies for construction equipment security information management

      2022, Automation in Construction
      Citation Excerpt :

      Dependent upon the information, managers can understand potential risks at the construction site in a project [13]. However, especially for text-recorded data, daily equipment inspection reports are often stored as unstructured or semi-structured text compared with structured sensor data, and it can be costly to analyze [14,15]. Besides, unstructured or semi-structured texts make managers difficult to retrieve valuable information in a system [16].

    • On exploring data lakes by finding compact, isolated clusters

      2022, Information Sciences
      Citation Excerpt :

      The Web is currently the most important data source since it provides a plethora of datasets on virtually any topics. A data lake is a repository to which data engineers dump as many datasets as possible in an attempt not to miss any chances to infer new valuable knowledge [41,25,33,15]. It is then not surprising that many IT providers are competing to devise technologies that help data engineers work with their data lakes [22,4,34].

    • Advanced big-data/machine-learning techniques for optimization and performance enhancement of the heat pipe technology – A review and prospective study

      2021, Applied Energy
      Citation Excerpt :

      The basic characteristic is that they are “self-describing”, which means that the information generally associated with the schema is specified directly within data. Different forms of big data are compared in Table 4 [69,70]. The big data technologies [71] represent the technologies to efficiently deal with huge data feeds, due to the capability to process data in a variety of environments, e.g., batch, and stream.

    • Bibliographically coupled patents: Their temporal pattern and combined relevance

      2019, Journal of Informetrics
      Citation Excerpt :

      Researchers have been using terms “relatedness,” “similarity,” and “proximity” along with terms “technology,” “patent,” and “knowledge” in different contexts to refer to similar concepts. It seems to us that, for specific features of patents, “similarity” is more often used, such as “keyword similarity” (cf. Giudice, Musarella, Sofo, & Ursino, 2019) and “classification similarity” (cf. Kuan, Chiu, Liu, Huang, & Chen, 2018); “proximity” is more often applied in knowledge spillover context and usually at firm or higher aggregate level (cf. Chu, Tian, & Wang, 2018). This study chooses to use “relatedness” as it seems to be a broader and more abstract concept.

    View all citing articles on Scopus
    View full text