Mining the Web of Linked Data with RapidMiner
Introduction
The Web of Linked Data contains a collection of machine processable, interlinked datasets from various domains, ranging from general cross-domain knowledge sources to government, library and media data, which today comprises roughly a thousand datasets [1], [2]. While many domain-specific applications use Linked Open Data, general-purpose applications rarely go beyond displaying the mere data, and provide little means of deriving additional knowledge from the data.
At the same time, sophisticated data mining platforms exist, which support the user with finding patterns in data, providing meaningful visualizations, etc. What is missing is a bridge between the vast amount of data on the one hand, and intelligent data analysis tools on the other hand. Given a data analysis problem, a data analyst should be able to automatically find suitable data from different relevant data sources, which will then be combined and cleansed, and served to the user for further analysis. This data collection, preparation, and fusion process is an essential part of the data analysis workflow [3], however, it is also one of the most time consuming parts, constituting roughly half of the costs in data analytics projects [4]. Furthermore, since the step is time consuming, a data analyst most often makes a heuristic selection of data sources based on his a priori assumptions, and hence is subject to the selection bias. Despite these issues, automation at that stage of the data processing step is still rarely achieved.
In this paper, we discuss how the Web of Linked Data can be mined using the full functionality of the state of the art data mining environment RapidMiner1 [5]. We introduce an extension to RapidMiner, which allows for bridging the gap between the Web of Data and data mining, and which can be used for carrying out sophisticated analysis tasks on and with Linked Open Data. The extension provides means to automatically connect local data to background knowledge from Linked Open Data, or load data from the desired Linked Open Data source into the RapidMiner platform, which itself provides more than 400 operators for analyzing data, including classification, clustering, and association analysis.
RapidMiner is a programming-free data analysis platform, which allows the user to design data analysis processes in a plug-and-play fashion by wiring operators. Furthermore, functionality can be added to RapidMiner by developing extensions, which are made available on the RapidMiner Marketplace.2 The RapidMiner Linked Open Data extension adds operators for loading data from datasets within Linked Open Data, as well as autonomously following RDF links to other datasets and gathering additional data from there. Furthermore, the extension supports schema matching for data gathered from different datasets.
As the operators from that extension can be combined with all RapidMiner built-in operators, as well as those from other extensions (e.g., for time series analysis), complex data analysis processes on Linked Open Data can be built. Such processes can automatically combine and integrate data from different datasets and support the user in making sense of the integrated data.
The use case we pursue in this paper starts from a Linked Open Dataset publishing various World Bank indicators. Among many others, this dataset captures the number of scientific journal publications in different countries over a period of more than 25 years. An analyst may be interested in which factors drive a high increase in that indicator. Thus, she needs to first determine the trend in the data. Then, additional background knowledge about the countries is gathered from the Web of Linked Data, which helps her in identifying relevant factors that may explain a high or low increase in scientific publications. Such factors are obtained, e.g., by running a correlation analysis, and the significant correlations can be visualized for a further analysis, and for determining outliers from the trend.
The rest of this paper is structured as follows. Section 2 describes the functionality of the RapidMiner Linked Open Data extension. In Section 3, we show the example use case of scientific publications in detail, whereas Section 4 briefly showcases other use cases for which the extension has been employed in the past. Section 5 presents evaluations of various aspects of the extension. Section 6 discusses related work, and Section 7 provides an outlook on future directions pursued with the extension.
Section snippets
Description
RapidMiner is a data mining platform, in which data mining and analysis processes are designed from elementary building blocks, so called operators. Each operator performs a specific action on data, e.g., loading and storing data, transforming data, or inferring a model on data. The user can compose a process from operators by placing them on a canvas and wiring their input and output ports, as shown in Fig. 1.
The RapidMiner Linked Open Data extension adds a set of operators to RapidMiner,
Example use case
In our example use case, we use an RDF data cube with World Bank economic indicators data10 as a starting point. The data cube contains time-indexed data for more than 1000 indicators in over 200 countries. As shown in Fig. 1, the process starts with importing data from that data cube (1). To that end, a wizard is used, which lets the user select the indicator(s) of interest. The complete data cube import wizard is shown in Fig. 3. In our example, in the first step we
Further use cases
The RapidMiner Linked Open Data extension has already been used in several prior use cases.
In [24], we use DBpedia and the RDF Book Mashup dataset to generate eight different feature sets for building a book recommender system. We built a hybrid, multi-strategy approach that combines the results of different base recommenders and generic recommenders into a final recommendation. The complete system was implemented within RapidMiner, combining the LOD extension and the Recommendation Extension.
Evaluation
Many of the algorithms implemented by the RapidMiner LOD extension have been evaluated in different settings. In this section, we point out the key evaluations for the most important features of the extension, and introduce additional evaluations of specific aspects of the extension.
Related work
The use of Linked Open Data in data mining has been proposed before, and implementations as RapidMiner extensions as well as proprietary toolkits exist.
The direct predecessor of the RapidMiner LOD extension is the FeGeLOD toolkit [41], a data preprocessing toolkit based on the Weka platform [42], which contains basic versions of some of the operators offered by the LOD extension.
Different means to mine data in Linked Open Datasets have been proposed, e.g., an extension for RapidMiner [43],
Conclusion and outlook
In this paper, we have introduced the RapidMiner Linked Open Data extension. It provides a set of operators for augmenting existing datasets with additional attributes from open data sources, which often leads to better predictive and descriptive models. The RapidMiner Linked Open Data extension provides operators that allow for adding such attributes in an automatic, unsupervised manner.
There are different directions of research that are currently pursued in order to improve the extension.
Acknowledgment
The work presented in this paper has been partly funded by the German Research Foundation (DFG) under grant number PA 2373/1-1 (Mine@LOD).
References (56)
- et al.
From shiq and rdf to owl: The making of a web ontology language
Web Semant. Sci. Serv. Agents World Wide Web
(2003) - et al.
Data mining and linked open data—new perspectives for data analysis in environmental research
Ecol. Model.
(2015) - et al.
Linked data — the story so far
Int. J. Semant. Web Inf. Syst.
(2009) - M. Schmachtenberg, C. Bizer, H. Paulheim, Adoption of the linked data best practices in different topical domains, in:...
- et al.
Advances in Knowledge Discovery and Data Mining
(1996) - et al.
Discovering Data Mining: From Concept to Implementation
(1998) - et al.
RapidMiner: Data Mining Use Cases and Business Analytics Applications
(2013) - et al.
Olap4ld — a framework for building analysis applications over governmental statistics
- et al.
The rdf book mashup: from web apis to a web of data
- J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P.N. Mendes, S. Hellmann, M. Morsey, P. van Kleef, S....
An Introduction to Support Vector Machines and Other Kernel-based Learning Methods
Weisfeiler-lehman graph kernels
J. Mach. Learn. Res.
A comparison of propositionalization strategies for creating features from linked open data
Feature selection in hierarchical feature spaces
Discoverability of sparql endpoints in linked open data
PARIS: probabilistic alignment of relations, instances, and schema
PVLDB
The geography of poverty and wealth
Sci. Am.
Cited by (79)
DKPNet41: Directed knight pattern network-based cough sound classification model for automatic disease diagnosis
2022, Medical Engineering and PhysicsCitation Excerpt :Various automated computer-aided diagnostic methods for the detection of asthma, Covid-19, and heart failure have been published using diverse medical imaging and biomedical signals as input [22, 23]. Using the RapidMiner application [24], Yunus et al. [25] designed a heart failure classification system based on 11 data attributes (age, sex, smoking, anemia, platelets, diabetes, ejection fraction, high blood pressure, serum sodium, serum creatine, time). They tested the model on 299 samples from the UCI machine learning repository [26] and achieved classification accuracy rates of 86.95% and 94.31% with the k-nearest neighbor (kNN) and random forest implementations, respectively.
Identifying trends, patterns, and collaborations in nursing career research: A bibliometric snapshot (1980–2017)
2020, CollegianCitation Excerpt :MS-Excel was also used to arrange summary tables and graphs. For topic analysis, the filtered dataset was firstly preprocessed in Rapidminer Studio 8.2 (Text Processing Extension) to tokenize the text data and to clear useless parts of the text (Ristoski, Bizer, & Paulheim, 2015). Afterwards, a Java application for HLTA was used to obtain topical hierarchies by using progressive expectation maximization in HLTA codes (Chen et al., 2016).
Influence of Network Structure Changes on Co-word Network Link Prediction
2024, Data Analysis and Knowledge DiscoverySemantic web technologies: Architecture, application, trends, and challenges
2023, Semantic Intelligent Computing and ApplicationsmiRNA profiling as a complementary diagnostic tool for amyotrophic lateral sclerosis
2023, Scientific Reports