Elsevier

Journal of Web Semantics

Volume 15, September 2012, Pages 51-61
Journal of Web Semantics

Internationalization of Linked Data: The case of the Greek DBpedia edition

https://doi.org/10.1016/j.websem.2012.01.001Get rights and content

Abstract

This paper describes the deployment of the Greek DBpedia and the contribution to the DBpedia information extraction framework with regard to internationalization (I18n) and multilingual support. I18n filters are proposed as pluggable components in order to address issues when extracting knowledge from non-English Wikipedia editions. We report on our strategy for supporting the International Resource Identifier (IRI) and introduce two new extractors to complement the I18n filters. Additionally, the paper discusses the definition of Transparent Content Negotiation (TCN) rules for IRIs to address de-referencing and IRI serialization problems. The aim of this research is to establish best practices (complemented by software) to allow the DBpedia community to easily generate, maintain and properly interlink language-specific DBpedia editions. Furthermore, these best practices can be applied for the publication of Linked Data in non-Latin languages in general.

Introduction

The Semantic Web [1] is the evolution of the World Wide Web towards representing meaning of information in a way that is processable by machines. Most recently the Semantic Web vision was enriched by the concept of the Linked Data (LD) [2], a movement which within a short time led to a vast amount of Linked Data on the Web accessible in a simple yet standardized way. DBpedia [3] is one of the most prominent LD examples. It is an effort to extract knowledge represented as RDF from Wikipedia as well as to publish and interlink the extracted knowledge according to the Linked Data principles. DBpedia is currently the largest hub on the Web of Linked Data [4].

The early versions of the DBpedia Information Extraction Framework (DIEF) used only the English Wikipedia as sole source. Since the beginning, the focus of DBpedia has been to build a fused, integrated dataset by integrating information from many different Wikipedia editions. The emphasis of this fused DBpedia was still on the English Wikipedia as it is the most abundant language edition. During the fusion process, however, language-specific information was lost or ignored. The aim of this research is to establish best practices (complemented by software) that allow the DBpedia community1 to easily generate, maintain and properly interlink language-specific DBpedia editions. We realized this best practice using the Greek Wikipedia as a basis and prototype and contributed this work back to the original DIEF. We envisage the Greek DBpedia to serve as a hub for an emerging Greek Linked Data (GLD) Cloud [5].

The Greek Wikipedia is, when compared to other Wikipedia language editions, still relatively small–66th in article count2–with around 65,000 articles. Although the Greek Wikipedia is currently not as well organized–regarding infobox usage and other aspects–as the English one there is a strong support action by the Greek government3 foreseeing Wikipedia’s educational value to promote article authoring in schools, universities and by everyday users. This action is thus quickly enriching the GLD cloud. In addition, the Greek government, following the initiative of open access of all public data, initiated the geodata project,4 which is publishing data from the public sector. The Greek DBpedia will not only become the core where all these datasets will be interlinked, but also provides guidelines on how they could be published, how non-Latin characters can be handled and how the Transparent Content Negotiation (TCN) rules (RFC 2295) [6] for de-referencing can be implemented.

After discussing the current status of DBpedia (in Section 2), we present the results of the development and implementation of a new internationalized DBpedia Information Extraction Framework (I18n-DIEF), which consists in particular of the following novelties:

  • i.

    Implementation of the DBpedia I18n Framework by plugging I18n filters into the original DBpedia framework (Section 3). This is now part of the official DBpedia Framework and it is used for the Greek and three additional localized DBpedia editions (cf. Section 2.3).

  • ii.

    Use of DBpedia as a statistical diagnostic tool for Wikipedia correctness (Sections 4.1 A Greek Wikipedia case study, 7 Statistics and evaluation).

  • iii.

    Development of the Template-Parameter Extractor to facilitate a semi-automatic mapping add-on tool and provide the basis for the infobox-to-ontology mapping statistics (Section 4.2).

  • iv.

    Justification of the need for language-specific namespaces (Section 5).

  • v.

    The linking of language-specific DBpedia editions to existing link targets of the English DBpedia in the LOD Cloud (Section 5.1).

  • vi.

    Definition of Transparent Content Negotiation rules for IRI dereferencing (Section 6.1).

  • vii.

    Identification of IRI serialization problems and the proposal for an effective solution (Section 6.2).

Section snippets

Background and problem description

In this section we introduce the state of DBpedia with regard to the extraction framework, the data topology as well as the internationalization support.

A solution overview: the I18n extension of the DBpedia Information Extraction Framework

Prior this work, the main focus of the DIEF was on the English Wikipedia and non-English languages were generally limited to information adhering to common patterns across different Wikipedia language editions (cf. Section 2.2). In this section, we describe our I18n extensions of the DIEF, which improve I18n support of the software and introduce customizability features for language-specific Wikipedia editions. The Greek language is well suited for a I18n case study as a language with non-Latin

Infobox mappings and properties

Among the richest sources of structured information in Wikipedia articles tapped by DBpedia are infobox templates.20 For this purpose two kinds of extractors exist, namely: the Generic Infobox Extractor and the Mapping-Based Infobox Extractor. The former is straightforward and creates one triple for every infobox parameter in the form:

Language-specific design of DBpedia resource identifiers

Currently, the fused DBpedia extracts non-English Wikipedia articles only when they provide an English interlanguage link and the created resources use the default DBpedia namespace (cf. Section 2.2). Although this approach minimizes the use of non-Latin characters in resource identifiers, it has the following drawbacks:

  • 1.

    The merging is solely based on the link from the non-English resource to the English article. It has been shown that such links are more appropriate if the interlanguage links

International Resource Identifiers

Linked datasets are expected to provide machine processable as well as user readable and interpretable content (e.g. an HTML representation) [14]. However, the requirements for readability (see e.g. lexvo’s presentation of the term ‘door’28 and the translation links), and for manual SPARQL query construction (cf. Listing 4) in non-Latin languages such as Greek cannot be satisfied using URIs. Therefore, the only option currently available is to use

Statistics and evaluation

In this section we will look at the attained improvements due to the I18n revision of the DIEF. The results with regard to extracted triples for all available extractors (presented in Table 2) allow comparison between the Greek and the English DBpedia editions. Extractions of the Greek Wikipedia with the I18n-DIEF refer to the same Wikipedia dump as the Greek DBpedia v 3.5.1. The final result of our efforts is presented in the column labeled I18n-aa (Greek DBpedia I18n-DIEF—all articles).

In the

Conclusions

With the maturing of Semantic Web technologies proper support for internationalization is a crucial issue. This particularly involves the internationalization of resource identifiers, RDF serializations and corresponding tool support. The Greek DBpedia is the first step towards Linked Data internationalization and the first successful attempt to serve Linked Data with de-referenceable IRIs that also serves as a guide for LOD publishing in non-Latin languages. Apart from the de-referenceable IRI

Acknowledgments

This project would not have been completed without the continuous support of the DBpedia team, the students and the staff of the Webscience M.Sc., Mr. Konstantino Stampouli,34 Greek Wikipedia administrator, and the Webscience M.Sc. program of Aristotle University of Thessaloniki that facilitated this effort. The administrative and financial support of the municipality of Veria is gratefully acknowledged. This work was also partially supported by a grant

References (22)

  • T. Berners-Lee et al.

    The semantic web

    Scientific American

    (2001)
  • C. Bizer et al.

    Linked data—the story so far

    International Journal on Semantic Web and Information Systems

    (2009)
  • J. Lehmann et al.

    DBpedia—a crystallization point for the web of data

    Journal of Web Semantics

    (2009)
  • G. Kobilarov, C. Bizer, S. Auer, J. Lehmann, DBpedia—a linked data hub and data source for web applications and...
  • C. Bratsas, S. Alexiou, D. Kontokostas, I. Parapontis, I. Antoniou, G. Metakides, Greek open data in the age of linked...
  • K. Holtman, A. Mutz, Transparent Content Negotiation in HTTP, RFC 2295 (Experimental) (March 1998). URL...
  • S. Hellmann et al.

    DBpedia live extraction

  • E. Kyung Kim et al.

    Towards a Korean DBpedia and an approach for complementing the Korean Wikipedia based on DBpedia

  • S. Auer et al.

    I18n of semantic web applications

  • D. Vrandecic, Ontology evaluation, Ph.D. Thesis, KIT, Fakultät für Wirtschaftswissenschaften, Karlsruhe,...
  • M. Erdmann et al.

    Extraction of bilingual terminology from a multilingual web-based encyclopedia

    Journal of Information Processing

    (2008)
  • Cited by (0)

    View full text