Keywords

1 Introduction

The advent of Linked Data [1] accelerates the evolution of the Web into an exponentially growing information space where the unprecedented volume of data offers information consumers a level of information integration that has up to now not been possible. Consumers can now mashup and readily integrate information for use in a myriad of alternative end uses.

In the recent days, governmental organizations publish their data as open data (most typically as CSV files). To fully exploit the potential of such data, the publication process should be improved, so that data are published as Linked Open Data. By leveraging open data to Linked Data, we increase usefulness of the data by providing global identifiers for things and we enrich the data with links to external sources.

To leverage CSV files to Linked DataFootnote 1, it is necessary to (1) classify CSV columns based on its content and context against existing knowledge bases (2) assign RDF terms (HTTP URLs, blank nodes and literals) to the particular cell values according to Linked Data principles (HTTP URL identifiers may be reused from one of the existing knowledge bases), (3) discover relations between columns based on the evidence for the relations in the existing knowledge bases, and (4) convert CSV data to RDF data properly using data types, language tags, well-known Linked Data vocabularies, etc.

To introduce an illustrative example of leveraging CSV files to Linked Data, if the published CSV file would contain names of the movies in the first column and names of the directors of these movies in the second column, the leveraging of CSV files to Linked Data should automatically (1) classify first and second column as containing instances of classes ‘Movie’ and ‘Director’ respectively, (2) convert cell values in the ‘movies’ and ‘directors’ columns to HTTP URL resources, e.g., instead of using ‘Matrix’ as the name of the movie, URL ‘http://www.freebase.com/m/02116f’ may be used pointing to Freebase knowledge baseFootnote 2 and standing for ‘Matrix’ movie with further attributes of that movie and links to further resources, and (3) discover relations between columns, such as relation ‘isDirectedBy’ between first and second columnFootnote 3.

In this paper, we focus on the CSV files available at two Austrian data portals – http://www.data.gv.at and http://www.opendataportal.at. The first one is the official national Austrian data portal, with lots of datasets published by the Austrian government. Our goal is not to find a solution, which automatically leverages tabular data to Linked Data, as this is really challenging and we are aware of that, but our goal is to help data wranglers to convert tabular data to Linked Data by suggesting them (1) concepts classifying the columns and (2) entities the cell values may be disambiguated to. To realize these steps, we evaluate TableMiner+, an algorithm for (semi)automatic leveraging of tabular data to Linked Data. By successfully classifying columns and disambiguating cell values, we immediately increase the quality of the data along the interlinking quality dimension [9].

The main contributions of this paper are lessons learned from evaluating TableMiner+ to classify columns and disambiguate cell values in CSV files obtained from the national Austrian open data portal. In [10], they also evaluate TableMiner+, nevertheless, (1) they do not evaluate TableMiner+ on top of CSV files and (2) they do not evaluate TableMiner+ on top of governmental open data, containing, e.g., lots of statistical data.

The rest of the paper is organized as follows. Section 2 discusses possible approaches for leveraging tabular data to Linked Data and justifies selection of TableMiner+ as the most promising algorithm for leveraging CSV files to Linked Data. Section 3 evaluates TableMiner+ algorithm on top of the data obtained from the national Austrian data portal. Section 4 summarizes lessons learned and we conclude in Sect. 5.

2 TableMiner+ and Related Work

TableMiner+ [10] is an algorithm for (semi)automatic leveraging of tabular data to Linked Data. TableMiner+ consumes a table as the input. Further, it (1) discovers subject column of the table (the ‘primary key’ column containing identifiers for the rows), (2) classifies columns of the table to concepts (topics) available in Freebase, (3) links (disambiguates) cell values against Linked Data entities in Freebase, and (4) discovers relations among the columns by trying to find evidence for the relations in Freebase. TableMiner+ uses Freebase as its knowledge base; as the authors in [10] claim, Freebase is currently the largest knowledge base and Linked Data set in the world, containing over 2.4 billion facts about over 43 million topics (e.g., entities, concepts), significantly exceeding other popular knowledge bases such as DBpediaFootnote 4 and YAGO [7]. TableMiner+ is available under an open license – Apache License v2.0.

Limaye et al. [3] model table components (e.g. headers of columns, cells) and their interdependence using a probabilistic graphical model, which consists of two components: variables that model different table components, and factors modeling (1) the compatibility between the variable and each of its candidate and (2) the compatibility between the variables believed to be correlated. For example, given a named entity column, the header of the column is a variable that takes values from a set of candidate concepts; each cell in the column is a variable that takes values from a set of candidate entities. The task of inference amounts to searching for an assignment of values to the variables that maximizes the joint probability [10].

Mulwad et al. [5] argue that computing the joint probability distribution in Limaye’s method [3] is very expensive. Built on the earlier work by Syed et al. [8] and Mulwad et al. [4, 6], they propose a lightweight semantic message passing algorithm that applies inference to the same kind of graphical model.

When comparing approach of Limaye et al. and Mulwad et al. with TableMiner+ approach, as the authors [10] state, TableMiner+ approach is fundamentally different since it (1) adds a subject column detection algorithm, (2) deals with both named entity columns and literal columns, while Mulwad et al. only handle named entity columns, (3) uses an efficient approach bootstrapped by sampled data from the table while Mulwad et al. and also Limaye et al. build a model that approaches the task in an exhaustive way, which is not efficient, (4) uses different methods for scoring and ranking candidate entities, concepts and relations; and (5) models interdependence differently which, if transformed to an equivalent graphical model, would result in fewer factor nodes.

In [2], the authors present an approach for enabling the user-driven semantic mapping of large amounts of tabular data using MediaWikiFootnote 5 system. Although we agree that user’s feedback is important when judging about the correctness of the suggested concept for classification or suggested entity for disambiguation, and completely automated solutions leveraging tabular data to Linked Data are very challenging, the approach in [2] relies solely on the user-driven mappings, which expects too much effort from the users.

Open RefineFootnote 6 with RDF extensionFootnote 7 provides a service to disambiguate cell values to Linked Data entities, e.g., from DBpedia. Nevertheless, the disambiguation is not interconnected with the classification as in case of, e.g., the TableMiner+ approach introduced in [10], so either a user has to manually specify the concept (class) restricting the candidate entities for disambiguation or all entities are considered during disambiguation, which is inefficient. Furthermore, the disambiguation phase is based just on the comparison of labels, without taking into account the context of the cell – further row cell values, column values, column header, etc.

We decided to use TableMiner+ to leverage CSV data from national Austrian data portal to Linked Data, because it outperforms similar algorithms, such as the one proposed by Mulwad et al. [5] or the algorithm presented in [3] and is available under an open license.

3 Evaluation

In this section, we describe the evaluation of TableMiner+ algorithm on top of CSV files obtained from the national Austrian data portal available at http://data.gv.at. First we provide basic statistics about the data we use in the evaluation and then we describe evaluation metrics and results obtained during evaluation of (1) subject column detection, (2) classification, and (3) disambiguation. We do not evaluate in this paper the process of discovering binary relations among columns of the input files.

Since the standard distribution of TableMiner+ algorithmFootnote 8 expects HTML tables as the input, we extended the algorithm, so that it supports also CSV files as the inputFootnote 9.

3.1 Data and Basic Statistics

We evaluated TableMiner+ on top of 753 files out of 1491 CSV files (50.5 %) obtained from the national Austrian data portal http://data.gv.at. The files processed were randomly selected from the files having less than 1 MB in size and having correct non-empty headers for all columns. We processed at most first 1000 rows from every such file. The processed files had in average 8.46 columns and 1.47 named entity columns.

3.2 Subject Column Detection

From all the processed files, we selected those for which TableMiner+ algorithm identified more than one named entity column and for those, we evaluated precision of the subject column detection by comparing the subject column selected by the TableMiner+ algorithm for the given file and the subject column manually annotated as being correct by a human annotatorFootnote 10.

Results. In 97.15 % of cases, the subject column was properly identified by the TableMiner+ algorithm. There were couple of issues, e.g., considering column with companies, rather than with projects as the subject column in the CSV file containing list of projects. In case of statistical data containing couple of dimensions and measures, every dimension (except of the time dimension) was considered as a correctly identified subject column.

3.3 Classification

In TableMiner+ algorithm, candidate concepts classifying certain column are computed in phases. First, a sample of cells of the processed column is selected, disambiguated and the concepts of the disambiguated entities vote for the initial winning concept classifying the column. Further, all cells within that column are disambiguated, taking into account restrictions given by the initial winning concept, and, afterwards, all disambiguated cells vote once again for the concept classifying the column. If the winning concept classifying the column changes, disambiguation and voting is iterated. Lastly, candidate concepts for the given column are reexamined in the context of other columns and their candidate concepts, which may once again lead to the change of the winning concept suggested by TableMiner+ algorithm for the column. At the end, TableMiner+ algorithm reports the winning concept for every named entity column and also further candidate concepts, together with their scores (winning concept has the highest score).

To evaluate precision of such classification, for each processed file and named entity column, we marked down the candidate concepts for the classification together with the scores computed by TableMiner+ algorithm, sorted by the descending scores. Then we selected candidate concepts having 5 highest scores (since more candidate concepts may have the same score, this may include more than 5 candidate concepts). Afterwards, we selected a random sample of these selected candidate concepts (containing selected candidate concepts for 100 columns) and let annotators to annotate for each file and named entity column the classifications suggested by the TableMiner+ – annotators marked the suggested column classification either with best, good or wrong labels. Label best means that the candidate concept is the best concept which may be used in the given situation – it must properly describe the semantics of the classified column and it must be the most specific concept as possible as the goal is to prefer the most specific concepts among all suitable concepts; for example, instead of the concept location/location, the concept location/citytown is the preferred concept for the column containing list of Austrian cities. Label good means that the candidate concept is appropriate (it properly describes the semantics of the cell values in the column), but it is not necessarily the most suitable concept. Label wrong means that the candidate concept is inappropriate, it has a different semantics.

Let us denote \(\#Cols\) as the number of columns annotated by annotators. Further, let us define function \(top_{N}(c)\), which is equal to 1 if the candidate concept c annotated as best for certain column was also identified by TableMiner+ as a concept having up to N-th highest score, \(N \in {1, 2, 3, 4, 5}\). If \(N=1\) and \(top_{1}(c) = 1\) for certain concept c, it means that the winning concept suggested by TableMiner+ is the same as the concept annotated as best by the annotators. Further, let us define metric \(best_{N}\) which computes the percentage of columns in which the candidate concept c annotated as best for certain column was also identified by TableMiner+ as a concept having N-th highest score at worst; divided by total number of annotated named entity columns:

$$\begin{aligned} best_{N} = 100\cdot \sum _{c} top_{N}(c)/\#Cols \end{aligned}$$

So, for example, \(best_{1}\) denotes the percentage of cases (columns) for which the concept annotated as best is also the winning concept suggested by TableMiner+.

The formula above does not penalize situations when more candidate concept share the same score. Since our goal is not to automatically produce Linked Data or column classification from the result of the TableMiner+, but we expect that user is presented with couple of candidate concepts (s)he verifies/selects from, it is not important whether (s)he is presented with 5 or 8 concepts, but it is important to evaluate how often the concept annotated as best is among the highest scored concepts.

Results. The winning concepts (Freebase topics) discovered by the TableMiner+ algorithm running on top of all 753 files from the portal which were suggested for at least 20 columns and the number of columns for which these concepts were suggested as winning concepts are depicted in Table 1.

Table 1. The winning concepts (Freebase topics) as discovered by TableMiner+

As we can see, majority of the columns were classified with the Freebase concept location/location. Although this is correct in most cases, typically, there is a better (more specific) concept available, such as location/citytown. There are also concepts, such as music/recording or film/film_character, which are in most cases results of the wrong classification due to low evidence for correct concepts during disambiguation of the sample cells.

Selected results of the \(best_{N}\) measure are introduced in Table 2. As we can see, 20 % of concepts annotated as best were properly suggested by the TableMiner+ algorithm as the winning concepts; 36 % of concepts annotated as best for certain columns were among concepts suggested by TableMiner+ and having highest or second highest score, etc. In other words, there is 76 % probability that the concept annotated as being best will appear within candidate concepts suggested by TableMiner+ and having 5th highest score at worst.

Furthermore, in 68 % of the analyzed columns, only concepts annotated as best and good appear among concepts suggested by TableMiner+ and having 1st, 2nd, or 3rd highest score.

In 24 % of the analyzed columns, all concept candidates suggested by TableMiner+ were wrongly suggested. The reasons for completely wrong suggested classifications are typically two-fold: (1) low disambiguation recall due to low evidence for the cell values within the Freebase knowledge base or (2) wrong disambiguation due to short named entities having various unintended meanings.

We did not evaluated recall of the concept classification, as there was always a suggested concept classifying the column, although the precision could have been low.

Table 2. Results of the \(best_{N}\) measure

3.4 Disambiguation

For selected concepts from Table 1, we computed precision and recall of the entities disambiguation. Precision is calculated as the number of distinct entities (cell values) being correctly linked to Freebase entities divided by the number of all distinct entities (cell values) linked to Freebase (restricted to the given concept). Recall is computed as the number of distinct entities being linked to Freebase divided by number of all distinct entities (restricted to the given concept). To know which entities were correctly linked to Freebase, we again asked annotators to annotate, for the columns classified with the selected concepts, each distinct winning disambiguation of the cell value to Freebase entity – annotators could have marked the winning disambiguated entity either as being correct or wrong. The disambiguation is correct if the disambiguated entity represents correctly the semantics of the cell value. Otherwise, it is marked as wrong.

Results. In case of location/citytown concept, we analyzed disambiguation of cities in 16 files, where the concept location/citytown was suggested as the winning concept by the TableMiner+ algorithm. The precision of the disambiguation was 95.2 %; the recall 88.1 %. We also analyzed other 24 files, where there was a column containing cities and one of the suggested concepts classifying that column (but not the winning concept) was location/citytown concept with the score above 1.0. In this case, precision was 99 % and recall 99.8 %, taking into account more than 500 distinct disambiguated entities. It is also worth mentioning that TableMiner+ algorithm properly disambiguates and classifies cell values based on the context of the cell; thus, in case of the column with the cities, the cell value Grambach is properly classified as the city and not the river.

We analyzed 23 files where there was a column containing districts of Austria classified with the winning concept location/location. The precision was 38.3 % and recall 100 %. The precision is lower because in this case, more than half of the districts (e.g. Leibnitz, Leoben) were classified as cities. The reason why these columns were classified at the end with the rather generic concept location/location and not with a more appropriate location/administrative_division is that some values within that column were wrongly disambiguated to cities and voted for location/citytown, some were disambiguated correctly to districts and voted for the best concept location/admi nistrative_division, and, since both these types of entities also belong to the concept location/location, the concept location/location was chosen as the winning one.

Concept base/aareas/schema/administrative_area has high precision 88 % and 100 % recall, but there were only 17 distinct districts of Linz processed.

Concept organization/organization has reasonable precision for columns holding schools – it links faculties to the proper universities with precision 75 % and recall 81 %. For other types of organizations, such as pharmacies, hospitals, etc., disambiguation does not work properly, because there are no corresponding entities to be linked in Freebase.

Disambiguation of people/person concept has very low precision. The reason for that is that vast majority of people are not in the knowledge base. For the same reason, also the precision of the concept business/employer is very low.

4 Lessons Learned

There is a high correlation between precision of the disambiguation and classification, which is caused by the fact that initial candidate concepts for the classification of a column are based on the votes of the disambiguated entities for the selected sample set of cells.

If the recall of the disambiguation is low (not much entities are disambiguated), it does not make sense to classify the column, as it will be in most cases misleading. In these cases, it is better to report that there is not enough evidence for the classification, rather than trying to classify the column somehow, because this ends up by suggesting completely irrelevant concepts, which confuses users.

Row context used by TableMiner+ algorithm proofed its usefulness in many situation. For example, it allowed to properly disambiguate commonly named cities having more than one matching entities in Freebase, i.e., the cities were properly disambiguated w.r.t. to the countries to which they belong.

If the cell values to be disambiguated are too short (e.g., abbreviations) and the precision of the subject column disambiguation, defining the context for these abbreviations, is low, it does not make sense to disambiguate these short cell values as the precision of such disambiguation will be low.

Classification/disambiguation in TableMiner+ has higher precision when the processed tabular data have subject column, which is further described by other columns, thus, classification/disambiguation may use reasonable row context. In case of statistical data, which merely involves measurements and dimension identifiers, the row context is not that beneficial and the precision of the classification/disambiguation is lower.

In many cases, the generic knowledge base, such as Freebase, is not sufficient as it does not include all needed information, e.g., it does not include information about all schools, hospitals, playgrounds, etc., in the country’s states/regions/cities. So apart from generic knowledge bases, such as Freebase, also more focused knowledge bases should be used or constructed upfront.

TableMiner+ algorithm should use knowledge bases defining hierarchies of concepts within the knowledge base, so that we can avoid situations when more generic concepts are denoted as winning concepts because more entities in the given columns vote for such generic concepts. Using hierarchy of concepts would increase precision and improve performance of the classification/disambiguation algorithm.

4.1 Contributions to Data Quality

Paper [9] provides a survey of Linked Data quality assessment dimensions and metrics. In this section, we discuss how successful classification and disambiguation conducted by TableMiner+ contribute towards higher quality of the resulting Linked Data along the quality dimensions introduced in [9].

Successful classification and disambiguation increase number of links to external (linked) data sources (such as general knowledge bases, e.g. DBpedia, or Freebase), thus, directly increase the quality of the data along the interlinking dimension [9]. By having links to external (linked) data sources, it is then possible to improve the quality of the data along the quality assessment dimensions introduced in further subsections.

Completeness. In [9], completeness is defined as “the degree to which all required information is present in a particular dataset”. Further, they distinguish:

  • Schema completeness: the degree to which the classes and properties of an ontology are represented

  • Property completeness: the degree to which values for a specific property are available for objects of a certain type or in general

  • Population completeness: the degree to which all real-world objects of a particular type are represented in the dataset

  • Interlinking completeness: the degree to which objects in the dataset are interlinked.

By disambiguating cell values to Linked Data entities, TablerMiner+ increases interlinking completness. By running TableMiner+ algorithm, we may also increase property completeness by introducing more facts about the objects from other (linked) data sources, e.g. from DBpedia or Freebase.

Semantic Accuracy. In [9], semantic accuracy is defined as “the degree to which data values correctly represent the real world facts”.

It is possible to reveal discrepancies in the data classified/disambiguated by TablerMiner+ by comparing the attributes of the disambiguated entities with the attributes of the data introduced in external (linked) data sources, e.g., in DBpedia or Freebase.

Trustworthiness. In [9], trustworthiness is defined as “the degree to which the information is accepted to be correct, true, real and credible”.

It is possible to increasing trustworthiness of the data processed by TableMiner+ by providing further evidence for the data from external (linked) data sources.

Interoperability. In [9], interoperability is defined as “the degree to which the format and structure of the information conforms to previously returned information as well as data from other sources”.

Paper [9] distinguishes two metrics for interoperability dimension: (1) re-use of existing terms and (2) re-use of existing vocabularies – to which extent relevant vocabularies are used.

By reusing existing identifiers from external (linked) data sources, we increase re-use of existing terms. Furthermore, by discovering relations in TableMiner+, we also contribute towards re-use of existing vocabularies.

5 Conclusions and Next Steps

We evaluated TableMiner+ algorithm on top of the Austrian open data obtained from the Austrian national open data portal available at http://www.data.gv.at.

We showed that in 76 % of cases the concept annotated by humans as being the best in the given situation appears within the candidate concepts suggested by TableMiner+ with 5th highest score at worst. This is a promising result, as our main purpose is to provide to data wranglers not only the winning concepts, but also certain number of alternative concepts.

Classification and disambiguation had very high precision for concept of cities (95 %+) and reasonable precision for certain other concepts, such as districts, states, organizations. Nevertheless, for certain columns/cell values, the precision of the classification/disambiguation was rather low, which was caused either by (1) missing evidence for the disambiguated cell values in the Freebase knowledge base or (2) by trying to disambiguate cell values which have various alternative meanings. We showed that in 24 % cases, the analyzed columns had irrelevant classification, which is rather confusing for users and in these cases it would be better not to produce any classification at all.

Although the first results are promising, we plan to experiment further (1) with different knowledge bases, such as WikiDataFootnote 11, and (2) also plan to improve TableMiner+ algorithm, so that it behaves, e.g., more conservative in cases of low evidence for the classification/disambiguation.