Mining the Web of Linked Data with RapidMiner

doi:10.1016/j.websem.2015.06.004

Journal of Web Semantics

Volume 35, Part 3, December 2015, Pages 142-151

https://doi.org/10.1016/j.websem.2015.06.004 Get rights and content

Abstract

Lots of data from different domains are published as Linked Open Data (LOD). While there are quite a few browsers for such data, as well as intelligent tools for particular purposes, a versatile tool for deriving additional knowledge by mining the Web of Linked Data is still missing. In this system paper, we introduce the RapidMiner Linked Open Data extension. The extension hooks into the powerful data mining and analysis platform RapidMiner, and offers operators for accessing Linked Open Data in RapidMiner, allowing for using it in sophisticated data analysis workflows without the need for expert knowledge in SPARQL or RDF. The extension allows for autonomously exploring the Web of Data by following links, thereby discovering relevant datasets on the fly, as well as for integrating overlapping data found in different datasets. As an example, we show how statistical data from the World Bank on scientific publications, published as an RDF data cube, can be automatically linked to further datasets and analyzed using additional background knowledge from ten different LOD datasets.

Introduction

The Web of Linked Data contains a collection of machine processable, interlinked datasets from various domains, ranging from general cross-domain knowledge sources to government, library and media data, which today comprises roughly a thousand datasets [1], [2]. While many domain-specific applications use Linked Open Data, general-purpose applications rarely go beyond displaying the mere data, and provide little means of deriving additional knowledge from the data.

At the same time, sophisticated data mining platforms exist, which support the user with finding patterns in data, providing meaningful visualizations, etc. What is missing is a bridge between the vast amount of data on the one hand, and intelligent data analysis tools on the other hand. Given a data analysis problem, a data analyst should be able to automatically find suitable data from different relevant data sources, which will then be combined and cleansed, and served to the user for further analysis. This data collection, preparation, and fusion process is an essential part of the data analysis workflow [3], however, it is also one of the most time consuming parts, constituting roughly half of the costs in data analytics projects [4]. Furthermore, since the step is time consuming, a data analyst most often makes a heuristic selection of data sources based on his a priori assumptions, and hence is subject to the selection bias. Despite these issues, automation at that stage of the data processing step is still rarely achieved.

In this paper, we discuss how the Web of Linked Data can be mined using the full functionality of the state of the art data mining environment RapidMiner¹ [5]. We introduce an extension to RapidMiner, which allows for bridging the gap between the Web of Data and data mining, and which can be used for carrying out sophisticated analysis tasks on and with Linked Open Data. The extension provides means to automatically connect local data to background knowledge from Linked Open Data, or load data from the desired Linked Open Data source into the RapidMiner platform, which itself provides more than 400 operators for analyzing data, including classification, clustering, and association analysis.

RapidMiner is a programming-free data analysis platform, which allows the user to design data analysis processes in a plug-and-play fashion by wiring operators. Furthermore, functionality can be added to RapidMiner by developing extensions, which are made available on the RapidMiner Marketplace.² The RapidMiner Linked Open Data extension adds operators for loading data from datasets within Linked Open Data, as well as autonomously following RDF links to other datasets and gathering additional data from there. Furthermore, the extension supports schema matching for data gathered from different datasets.

As the operators from that extension can be combined with all RapidMiner built-in operators, as well as those from other extensions (e.g., for time series analysis), complex data analysis processes on Linked Open Data can be built. Such processes can automatically combine and integrate data from different datasets and support the user in making sense of the integrated data.

The use case we pursue in this paper starts from a Linked Open Dataset publishing various World Bank indicators. Among many others, this dataset captures the number of scientific journal publications in different countries over a period of more than 25 years. An analyst may be interested in which factors drive a high increase in that indicator. Thus, she needs to first determine the trend in the data. Then, additional background knowledge about the countries is gathered from the Web of Linked Data, which helps her in identifying relevant factors that may explain a high or low increase in scientific publications. Such factors are obtained, e.g., by running a correlation analysis, and the significant correlations can be visualized for a further analysis, and for determining outliers from the trend.

The rest of this paper is structured as follows. Section 2 describes the functionality of the RapidMiner Linked Open Data extension. In Section 3, we show the example use case of scientific publications in detail, whereas Section 4 briefly showcases other use cases for which the extension has been employed in the past. Section 5 presents evaluations of various aspects of the extension. Section 6 discusses related work, and Section 7 provides an outlook on future directions pursued with the extension.

Section snippets

Description

RapidMiner is a data mining platform, in which data mining and analysis processes are designed from elementary building blocks, so called operators. Each operator performs a specific action on data, e.g., loading and storing data, transforming data, or inferring a model on data. The user can compose a process from operators by placing them on a canvas and wiring their input and output ports, as shown in Fig. 1.

The RapidMiner Linked Open Data extension adds a set of operators to RapidMiner,

Example use case

In our example use case, we use an RDF data cube with World Bank economic indicators data¹⁰ as a starting point. The data cube contains time-indexed data for more than 1000 indicators in over 200 countries. As shown in Fig. 1, the process starts with importing data from that data cube (1). To that end, a wizard is used, which lets the user select the indicator(s) of interest. The complete data cube import wizard is shown in Fig. 3. In our example, in the first step we

Further use cases

The RapidMiner Linked Open Data extension has already been used in several prior use cases.

In [24], we use DBpedia and the RDF Book Mashup dataset to generate eight different feature sets for building a book recommender system. We built a hybrid, multi-strategy approach that combines the results of different base recommenders and generic recommenders into a final recommendation. The complete system was implemented within RapidMiner, combining the LOD extension and the Recommendation Extension.

Evaluation

Many of the algorithms implemented by the RapidMiner LOD extension have been evaluated in different settings. In this section, we point out the key evaluations for the most important features of the extension, and introduce additional evaluations of specific aspects of the extension.

Related work

The use of Linked Open Data in data mining has been proposed before, and implementations as RapidMiner extensions as well as proprietary toolkits exist.

The direct predecessor of the RapidMiner LOD extension is the FeGeLOD toolkit [41], a data preprocessing toolkit based on the Weka platform [42], which contains basic versions of some of the operators offered by the LOD extension.

Different means to mine data in Linked Open Datasets have been proposed, e.g., an extension for RapidMiner [43],

Conclusion and outlook

In this paper, we have introduced the RapidMiner Linked Open Data extension. It provides a set of operators for augmenting existing datasets with additional attributes from open data sources, which often leads to better predictive and descriptive models. The RapidMiner Linked Open Data extension provides operators that allow for adding such attributes in an automatic, unsupervised manner.

There are different directions of research that are currently pursued in order to improve the extension.

Acknowledgment

The work presented in this paper has been partly funded by the German Research Foundation (DFG) under grant number PA 2373/1-1 (Mine@LOD).

References (56)

I. Horrocks et al.
From shiq and rdf to owl: The making of a web ontology language
Web Semant. Sci. Serv. Agents World Wide Web
(2003)
A. Lausch et al.
Data mining and linked open data—new perspectives for data analysis in environmental research
Ecol. Model.
(2015)
C. Bizer et al.
Linked data — the story so far
Int. J. Semant. Web Inf. Syst.
(2009)
M. Schmachtenberg, C. Bizer, H. Paulheim, Adoption of the linked data best practices in different topical domains, in:...
U.M. Fayyad et al.
Advances in Knowledge Discovery and Data Mining
(1996)
P. Cabena et al.
Discovering Data Mining: From Concept to Implementation
(1998)
M. Hofmann et al.
RapidMiner: Data Mining Use Cases and Business Analytics Applications
(2013)
B. Kämpgen et al.
Olap4ld — a framework for building analysis applications over governmental statistics
C. Bizer et al.
The rdf book mashup: from web apis to a web of data
J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P.N. Mendes, S. Hellmann, M. Morsey, P. van Kleef, S....

P.N. Mendes, M. Jakob, A. García-Silva, C. Bizer, Dbpedia spotlight: shedding light on the web of documents, in:...

N. Cristianini et al.

An Introduction to Support Vector Machines and Other Kernel-based Learning Methods

(2000)

G.K.D. de Vries, S. de Rooij, A fast and simple graph kernel for rdf, in: DMLOD,...

N. Shervashidze et al.

Weisfeiler-lehman graph kernels

J. Mach. Learn. Res.

(2011)

G.K.D. de Vries, A fast approximation of the weisfeiler-lehman graph kernel for rdf data, in: ECML/PKDD (1),...

P. Ristoski et al.

A comparison of propositionalization strategies for creating features from linked open data

Y. Jeong, S.-H. Myaeng, Feature selection using a semantic hierarchy for event recognition and type classification, in:...

P. Ristoski et al.

Feature selection in hierarchical feature spaces

B.B. Wang, R.I.B. Mckay, H.A. Abbass, M. Barlow, A comparative study for domain ontology guided feature extraction, in:...

H. Paulheim et al.

Discoverability of sparql endpoints in linked open data

F.M. Suchanek et al.

PARIS: probabilistic alignment of relations, instances, and schema

PVLDB

(2011)

J. Bleiholder, F. Naumann, Data fusion, ACM...

M. Bussmann, A. Kleine, The analysis of european research networks: Cooperation in the seventh framework program, in:...

J.D. Sachs et al.

The geography of poverty and wealth

Sci. Am.

(2001)

P. Ristoski, E.L. Mencıa, H. Paulheim, A hybrid multi-strategy recommender system using linked open...

H. Paulheim, Identifying wrong links between datasets by multi-dimensional outlier detection, in: Workshop on Debugging...

M. Goldstein, Anomaly detection, in: RapidMiner — Data Mining Use Cases and Business Analytics Applications,...

O. De Clercq, S. Hertling, V. Hoste, S.P. Ponzetto, H. Paulheim, Identifying disputed topics in the news, in: Linked...

Cited by (79)

DKPNet41: Directed knight pattern network-based cough sound classification model for automatic disease diagnosis
2022, Medical Engineering and Physics
Citation Excerpt :
Various automated computer-aided diagnostic methods for the detection of asthma, Covid-19, and heart failure have been published using diverse medical imaging and biomedical signals as input [22, 23]. Using the RapidMiner application [24], Yunus et al. [25] designed a heart failure classification system based on 11 data attributes (age, sex, smoking, anemia, platelets, diabetes, ejection fraction, high blood pressure, serum sodium, serum creatine, time). They tested the model on 299 samples from the UCI machine learning repository [26] and achieved classification accuracy rates of 86.95% and 94.31% with the k-nearest neighbor (kNN) and random forest implementations, respectively.
Cough-based disease detection is a hot research topic for machine learning, and much research has been published on the automatic detection of Covid-19. However, these studies are useful for the diagnosis of different diseases.
In this work, we collected a new and large (n=642 subjects) cough sound dataset comprising four diagnostic categories: ‘Covid-19’, ‘heart failure’, ‘acute asthma’, and ‘healthy’, and used it to train, validate, and test a novel model designed for automatic detection.
The model consists of four main components: novel feature generation based on a specifically directed knight pattern (DKP), signal decomposition using four pooling methods, feature selection using iterative neighborhood analysis (INCA), and classification using the k-nearest neighbor (kNN) classifier with ten-fold cross-validation. Multilevel multiple pooling decomposition combined with DKP yielded 41 feature vectors (40 extracted plus one original cough sound). From these, the ten best feature vectors were selected. Based on each vector's misclassification rate, redundant feature vectors were eliminated and then merged. The merged vector's most informative features automatically selected using INCA were input to a standard kNN classifier.
The model, called DKPNet41, attained a high accuracy of 99.39% for cough sound-based multiclass classification of the four categories.
The results obtained in the study showed that the DKPNet41 model automatically and efficiently classifies cough sounds for disease diagnosis.
Identifying trends, patterns, and collaborations in nursing career research: A bibliometric snapshot (1980–2017)
2020, Collegian
Citation Excerpt :
MS-Excel was also used to arrange summary tables and graphs. For topic analysis, the filtered dataset was firstly preprocessed in Rapidminer Studio 8.2 (Text Processing Extension) to tokenize the text data and to clear useless parts of the text (Ristoski, Bizer, & Paulheim, 2015). Afterwards, a Java application for HLTA was used to obtain topical hierarchies by using progressive expectation maximization in HLTA codes (Chen et al., 2016).
Several studies have investigated the social and technical dimensions of a career in nursing. This paper reveals tendencies and patterns in relevant literature, through bibliometrics and scientometrics.
This article aims to shed light on the scientific literature of nursing as a career, which is a growing field of study in the nursing category of the Web of Science.
The researchers designed and conducted a bibliometric and scientometric study in the Web of Science Database, in April 2018. The 1,434 articles the authors evaluated were published between 1980 and 2017 in the Web of Science database. They analyzed the retrieved dataset through distance-based, graph-based, and timeline-based approaches, and text analytics in the scope of scientometrics and bibliometrics.
The authors used summary statistics, text, and network analytics to determine the number of publications over the years. In addition, citation metrics, demographics, co-authorship identifications, citations, co-occurrence networks, and topic structures were used. In the keyword analysis of the studies, the concepts “nurse restriction,” “satisfaction,” “difficulties in the working environment,” and “burn out and stress” were found to be used intensively.
This study is intended for nurses, managers, researchers, and also policymakers, because it is critical for them to see the rhetoric of the debates in the literature and provide the best governance and the best quality services.
Influence of Network Structure Changes on Co-word Network Link Prediction
2024, Data Analysis and Knowledge Discovery
Semantic web technologies: Architecture, application, trends, and challenges
2023, Semantic Intelligent Computing and Applications
miRNA profiling as a complementary diagnostic tool for amyotrophic lateral sclerosis
2023, Scientific Reports
PDRs4All VI: Probing the Photochemical Evolution of PAHs in the Orion Bar Using Machine Learning Techniques
2023, arXiv

View all citing articles on Scopus

View full text