Building a Linked Open Data Cloud of Linguistic Resources: Motivations and Developments

Chiarcos, Christian; Moran, Steven; Mendes, Pablo N.; Nordhoff, Sebastian; Littauer, Richard

doi:10.1007/978-3-642-35085-6_12

Christian Chiarcos³,
Steven Moran⁴,
Pablo N. Mendes⁵,
Sebastian Nordhoff⁶ &
…
Richard Littauer⁷

Part of the book series: Theory and Applications of Natural Language Processing ((NLP))

1544 Accesses

Abstract

We describe on going community-efforts to create a Linked Open Data (sub-)cloud of linguistic resources, with an emphasis on resources that are specific to linguistic research, namely annotated corpora and linguistic databases. We argue that for both types of resources, the application of the Linked Open Data paradigm and the representation in RDF represents a promising approach to address interoperability problems, and to integrate information from different repositories. This is illustrated with example studies for different kinds of linguistic resources.The efforts described in this chapter are conducted in the context of the Open Linguistics Working Group (OWLG) of the Open Knowledge Foundation. The OWLG is a network of researchers interested in linguistic resources and/or their publication under open licenses, and a number of its members are engaged in the application of the Linked Open Data paradigm to their resources. Under the umbrella of the OWLG, these efforts will eventually emerge in the creation of a Linguistic Linked Open Data cloud (LLOD).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The term ‘resource’ is ambiguous here. As understood in this chapter, resources are structured collections of data which can be represented, for example, using RDF. In RDF, however, ‘resource’ is the conventional name of a node in the graph, because, historically, these nodes were meant to represent objects that are described by metadata. Hence, we use the terms ‘node’ or ‘concept’ whenever RDFresources are meant.
2.
Federation is possible with SPARQL, although not necessarily very performant with state-of-the-art implementations. A more efficient way than federation is thus to retrieve the content necessary for a particular application from another end point and to query it locally. SPARQL end points provide this functionality, and publishing data under open licenses (see below) warantees that the necessary legal preconditions for this practice are met.
3.
http://www.w3.org/DesignIssues/LinkedData.html, paragraph ‘Is your Linked Open Data 5 Star?’
4.
Although the application of RDF to linguistic resources as described here has been occasionally suggested, see [11, 13] for linguistic corpora, but these approaches focused on the RDF representation of individual resources rather than linking them with other types of linguistic resources. As opposed to this, the focus of this chapter is not on modeling linguistic resources, but rather, on the potential to linking these with each other.
5.
http://www.anc.org/MASC
6.
http://dbpedia.org/
7.
http://live.dbpedia.org/resource/Byzantine_Empire
8.
Note that [68] have also defined a TF-ICF measure (where “C” stands for “corpus”) with the objective to generate vectors of streaming documents in linear time. In DBpedia Spotlight’s TF*ICF (where “C” stands for “candidate”) the objective is to give more weight to words that are rare among confusable entities.
9.
http://spotlight.dbpedia.org/demo
10.
http://purl.org/vocabularies/princeton/wn30/synset-Byzantine-adjective-2
11.
http://glottolog.livingsources.org
12.
http://bibliontology.com/
13.
http://language-archives.org
14.
http://www.ethnologue.com
15.
http://multitree.org/
16.
http://wals.info
17.
http://phoible.org
18.
http://lingweb.eva.mpg.de/ids/
19.
http://languagelink.let.uu.nl/tds
20.
https://github.com/SebastianNordhoff/LingTyp.owl
21.
Compiled by Eugene Chang, http://lingweb.eva.mpg.de/numeral/
22.
http://ontowiki.net
23.
http://linguistics-ontology.org
24.
http://linguistlist.org/
25.
http://www.isocat.org
26.
http://purl.org/olia/penn-syntax.owl
27.
http://purl.org/olia/penn-syntax-link.rdf
28.
http://www.lexvo.org
29.
http://www.lingvoj.org
30.
http://linguistics.okfn.org
31.
http://okfn.org/
32.
http://opendefinition.org
33.
For example, complex copyright situations may arise if one resource (say, a lexicon) was developed on the basis of another resource (say, a newspaper archive), and researchers are uncertain whether the examples from the original newspaper contained in the lexicon violate the original copyright. Ethical problems may arise if a data base of quotations from a newspaper is linked to a data base of speakers, and this data base is further connected with, say, obituaries from the same newspaper. Even if this was done only in order to study generation-specific language variation, one may wonder whether such an accumulation of information violates the privacy of the people involved.
34.
http://lod-cloud.net
35.
http://framenet.icsi.berkeley.edu

References

Abney S, Bird S (2010) The Human Language Project: Building a universal corpus of the world’s languages. In: 48th Annual Meeting of the Association for Computational Linguistics (ACL-2010), Uppsala, Sweden, pp 88–97
Google Scholar
Baker C, Fellbaum C (2009) WordNet and FrameNet as complementary resources for annotation. In: 3rd Linguistic Annotation Workshop (LAW-2009), Suntec, Singapore, pp 125–129
Google Scholar
Bakker D, Dahl O, Haspelmath M, Koptjevskaja-Tamm M, Lehmann C, Siewierska A (1993) EUROTYP guidelines. Technical report, European Science Foundation Programme in Language Typology
Google Scholar
Baker C, Fillmore C, Lowe J (1998) The Berkeley FrameNet project. In: 36th Annual Meeting of the Association for Computational Linguistics (ACL-1998), Montréal, Canada, pp 86–90
Google Scholar
Bender E (2008) Evaluating a crosslinguistic grammar resource: A case study of Wambaya. In: 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-2008: HLT), Columbus, Ohio, pp 977–985
Google Scholar
Berners-Lee T (2006) Design issues: Linked data. http://www.w3.org/DesignIssues/LinkedData.html. Accessed 31 July 2012
Bies A, Ferguson M, Katz K, MacIntyre R (1995) Bracketing guidelines for Treebank II style Penn Treebank project. ftp://ftp.cis.upenn.edu/pub/treebank/doc/manual/root.ps.gz. Accessed 31 July 2012, version of January 1995
Bird S, Liberman M (2001) A formal framework for linguistic annotation. Speech Commun 33(1-2):23–60
Article Google Scholar
Brandes U, Eiglsperger M, Herman I, Himsolt M, Marshall M (2002) GraphML progress report: Structural layer proposal. In: 9th International Symposium on Graph Drawing (GD-2001), Vienna, Austria, pp 501–512
Google Scholar
Brown C, Holman E, Wichmann S, Velupillai V (2008) Automated classification of the world’s languages. STUF Lang Typol Univers 61(4):286–308
Google Scholar
Burchardt A, Padó S, Spohr D, Frank A, Heid U (2008) Formalising multi-layer corpora in OWL/DL – Lexicon modelling, querying and consistency control. In: 3rd International Joint Conference on NLP (IJCNLP-2008), Hyderabad, India
Google Scholar
Carletta J, Evert S, Heid U et al (2003) The NITE XML toolkit: Flexible annotation for multi-modal language data. Behav Res Methods Instrum Comput 35(3):353–363
Article Google Scholar
Cassidy S (2010) An RDF realisation of LAF in the DADA annotation server. In: 5th Joint ISO-ACL/SIGSEM Workshop on Interoperable Semantic Annotation (ISA-5), Hong Kong, China
Google Scholar
Chiarcos C (2012) A generic formalism to represent linguistic corpora in RDF and OWL/DL. In: 8th International Conference on Language Resources and Evaluation (LREC-2012), Istanbul, Turkey, pp 3205–3212
Google Scholar
Chiarcos C (2012) Ontologies of linguistic annotation: Survey and perspectives. In: 8th International Conference on Language Resources and Evaluation (LREC-2012), Istanbul, Turkey, pp 303–310
Google Scholar
Chiarcos C (2012) POWLA: Modeling linguistic corpora in OWL/DL. In: 9th Extended Semantic Web Conference (ESWC-2012), Heraklion, Crete, pp 225–239
Google Scholar
Chiarcos C, Dipper S, Götze M, Leser U, Lüdeling A, Ritz J, Stede M (2008) A flexible framework for integrating annotations from different tools and tagsets. TAL (Traitement automatique des langues) 49(2):217–246
Google Scholar
Chiarcos C, Hellmann S, Nordhoff S, Moran S, Littauer R, Eckle-Kohler J, Gurevych I, Hartmann S, Matuschek M, Meyer C (2012) The Open Linguistics Working Group. In: 8th International Conference on Language Resources and Evaluation (LREC-2012), Istanbul, Turkey, pp 3603–3610
Google Scholar
Chiarcos C, Nordhoff S, Hellmann S (eds) (2012) Linked data in linguistics. Representing and connecting language data and language metadata. Springer, Heidelberg
Google Scholar
Chiarcos C, McCrae J, Cimiano P, Fellbaum C (2012) Towards open data for linguistics: Linguistic linked data. In: Oltramari A, Lu-Qin, Vossen P, Hovy E (eds) New Trends of Research in Ontologies and Lexical Resources. Springer, Heidelberg
Google Scholar
Corbett G (2005) Number of genders. In: Haspelmath M, Dryer M, Gil D, Comrie B (eds) The World Atlas of Language Structures. Oxford University Press, Oxford
Google Scholar
Declerck T (2006) SynAF: Towards a standard for syntactic annotation. In: 5th International Conference on Language Resources and Evaluation (LREC-2006), Genoa, Italy, pp 229–232
Google Scholar
Dimitriadis A, Everaert M, Reinhart T, Reuland E (2005) Anaphora typology database. http://languagelink.let.uu.nl/anatyp. Accessed 31 July 2012
Dostert L (1955) The Georgetown-IBM experiment. In: Locke W, Booth A (eds) Machine translation of languages. Wiley, New York, pp 124–135
Google Scholar
Dryer M (1997) On the six-way word order typology. Stud Lang 21(1):69–103
Article Google Scholar
Eckart R (2008) Choosing an XML database for linguistically annotated corpora. Sprache und Datenverarbeitung 32(1):7–22
Google Scholar
Farrar S, Langendoen DT (2003) A linguistic ontology for the Semantic Web. GLOT Int 7:97–100
Google Scholar
Farrar S, Langendoen DT (2010) An OWL-DL implementation of GOLD: An ontology for the Semantic Web. In: Witt A, Metzing D (eds) Linguistic modeling of information and markup languages, Springer, Dordrecht
Google Scholar
Francis WN, Kucera H (1964) Brown Corpus manual, revised edition. Technival report, Brown University, Providence, Rhode Island, 1979
Google Scholar
Francopoulo G, George M, Calzolari N, Monachini M, Bel N, Pet M, Soria C (2006) Lexical Markup Framework (LMF). In: 5th International Conference on Language Resources and Evaluation (LREC-2006), Genoa, Italy, pp 233–236
Google Scholar
Gangemi A, Navigli R, Velardi P (2003) The OntoWordNet project: Extension and axiomatization of conceptual relations in WordNet. In: Meersman R, Tari Z (eds) Proceedings of On the Move to Meaningful Internet Systems (OTM-2003), Catania, Italy, pp 820–838
Google Scholar
Good J, Hendryx-Parker C (2006) Modeling contested categorization in linguistic databases. In: EMELD Workshop on Digital Language Documentation, East Lansing, MI
Google Scholar
Greenberg J (1960) A quantitative approach to the morphological typology of languages. Int J Am Linguist 26:178–194
Article Google Scholar
Haspelmath M, Tadmor U (eds) (2009) World Loanword Database. Max Planck Digital Library, Munich
Google Scholar
Haspelmath M, Dryer M, Gil D, Comrie B (eds) (2008) The World Atlas of Language Structures online. Max Planck Digital Library, Munich
Google Scholar
Hellmann S, Unbehauen J, Chiarcos C, Ngonga Ngomo AC (2010) The TIGER Corpus Navigator. In: 9th International Workshop on Treebanks and Linguistic Theories (TLT-9), Tartu, Estonia, pp 91–102
Google Scholar
Hellmann S, Lehmann J, Auer S (2012) Linked-data aware URI schemes for referencing text fragments. In: 18th International Conference on Knowledge Engineering and Knowledge Management (EKAW-2012), Galway, Ireland
Google Scholar
Hellmann S, Stadler C, Lehmann J (2012) The German DBpedia: A sense repository for linking entities. In: Chiarcos C, Nordhoff S, Hellmann S (eds) Linked data in linguistics, Springer, Heidelberg, pp 181–190
Chapter Google Scholar
Hwa R, Resnik P, Weinberg A, Cabezas C, Kolak O (2005) Bootstrapping parsers via syntactic projection across parallel texts. Nat Lang Eng 11(3):311–325
Article Google Scholar
Ide N, Pustejovsky J (2010) What does interoperability mean, anyway? Toward an operational definition of interoperability. In: 2nd International Conference on Global Interoperability for Language Resources (ICGL-2010), Hong Kong, China
Google Scholar
Ide N, Romary L (2004) International standard for a linguistic annotation framework. Nat Lang Eng 10(3-4):211–225
Article Google Scholar
Ide N, Romary L (2004) A registry of standard data categories for linguistic annotation. In: 4th International Conference on Language Resources and Evaluation (LREC-2004), Lisbon, Portugal, pp 135–139
Google Scholar
Ide N, Romary L (2006) Representing linguistic corpora and their annotations. In: 5th International Conference on Language Resources and Evaluation (LREC-2006), Genoa, Italy, pp 225–228
Google Scholar
Ide N, Suderman K (2007) GrAF: A graph-based format for linguistic annotations. In: 1st Linguistic Annotation Workshop (LAW-2007), Prague, Czech Republic, pp 1–8
Google Scholar
Ide N, Baker CF, Fellbaum C, Fillmore CJ, Passonneau R (2008) MASC: The Manually Annotated Sub-Corpus of American English. In: 6th International Conference on Language Resources and Evaluation (LREC-2008), Marrakech, Morocco, pp 2455–2461
Google Scholar
Kemps-Snijders M, Windhouwer M, Wittenburg P, Wright S (2008) ISOcat: Corralling data categories in the wild. In: 6th International Conference on Language Resources and Evaluation (LREC-2008), Marrakech, Morocco, pp 887–891
Google Scholar
Kilgarriff A, Grefenstette G (2003) Introduction to the special issue on the Web as Corpus. Comput Linguist 29(3):333–347
Article Google Scholar
Leech G, Wilson A (1996) EAGLES recommendations for the morphosyntactic annotation of corpora. http://www.ilc.cnr.it/EAGLES/annotate/annotate.html. Accessed 31 July 2012
Lehmann J, Bizer C, Kobilarov G et al (2009) DBpedia – A crystallization point for the Web of Data. J Web Semant 7(3):154–165
Article Google Scholar
Lewis W (2010) Haitian Creole: How to build and ship an MT engine from scratch in 4 days, 17 hours, & 30 minutes. In: 14th Annual Conference of the European Association for Machine Translation (EAMT-2010), Saint-Raphaël, France
Google Scholar
Lux M, Laußmann J, Mehler A, Menßen C (2011) An online platform for visualizing lexical networks. In: 2011 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT 2011), Lyons, France, pp 495–496
Google Scholar
Maddieson I (1984) Patterns of Sounds. Cambridge University Press, Cambridge/New York
Book Google Scholar
McClanahan P, Busby G, Haertel R, Heal K, Lonsdale D, Seppi K, Ringger E (2010) A probabilistic morphological analyzer for Syriac. In: 14th Conference on Empirical Methods on Natural Language Processing (EMNLP-2010), Cambridge, MA, pp 810–820
Google Scholar
McCrae J, Spohr D, Cimiano P (2011) Linking lexical resources and ontologies on the semantic web with Lemon. In: 8th Extended Semantic Web Conference (ESWC-2011), Heraklion, Crete. Springer, pp 245–259
Google Scholar
Mendes P, Jakob M, García-Silva A, Bizer C (2011) DBpedia Spotlight: Shedding light on the Web of Documents. In: 7th International Conference on Semantic Systems (I-Semantics 2011), Graz, Austria
Google Scholar
Mendes P, Daiber J, Rajapakse R et al (2012) Evaluating the impact of phrase recognition on concept tagging. In: 8th International Conference on Language Resources and Evaluation (LREC-2012), Istanbul, Turkey
Google Scholar
Mendes P, Jakob M, Bizer C (2012) DBpedia for NLP: A multilingual cross-domain knowledge base. In: 8th International Conference on Language Resources and Evaluation (LREC-2012), Istanbul, Turkey
Google Scholar
Meyers A, Ide N, Denoyer L, Shinyama Y (2007) The shared corpora working group report. In: 1st Linguistic Annotation Workshop (LAW-2007), Prague, Czech Republic, pp 184–190
Google Scholar
Michaelis S, Maurer P, Haspelmath M, Huber M (eds) (to appear 2013) Atlas of Pidgin and Creole Language Structures. Oxford University Press, Oxford
Google Scholar
Moran S (2012) Phonetics information base and lexicon. PhD thesis, University of Washington
Google Scholar
Moran S (2012) Using linked data to create a typological knowledge base. In: Chiarcos C, Nordhoff S, Hellmann S (eds) Linked data in linguistics. Springer, Heidelberg, pp 129–138
Chapter Google Scholar
Moran S, McCloy D, Wright R (2012) Revisiting the population vs phoneme inventory correlation. In: 86th Annual Meeting of the Linguistic Society of America (LSA-2012), Portland, OR
Google Scholar
Morris W (ed) (1969) The American Heritage dictionary of the English language. Houghton Mifflin, New York
Google Scholar
Nordhoff S, Hammarström H (2011) Glottolog/Langdoc: Defining dialects, languages, and language families as collections of resources. In: 1st International Workshop on Linked Science (LISC-2011), Bonn, Germany
Google Scholar
Pederson T (2008) Empiricism is not a matter of faith. Comput Linguist 34(3):465–470
Article Google Scholar
Ponzetto S, Navigli R (2009) Large-scale taxonomy mapping for restructuring and integrating Wikipedia. In: 21st International Joint Conference on Artificial Intelligence (IJCAI-2009), Pasadena, CA, pp 2083–2088
Google Scholar
Prud’Hommeaux E, Seaborne A (2008) SPARQL query language for RDF. W3C Recommendation. http://www.w3.org/TR/rdf-sparql-query, Accessed Dec, 20th, 2013
Reed JW, Jiao Y, Potok TE, Klump BA, Elmore MT, Hurson AR (2006) TF-ICF: A new term weighting scheme for clustering dynamic data streams. In: 5th International Conference on Machine Learning and Applications (ICMLA-2006), Washington, DC, pp 258–263
Google Scholar
Romary L, Zeldes A, Zipser F (2011) [Tiger2/] – serialising the ISO SynAF syntactic object model. Arxiv preprint arXiv:11080631
Google Scholar
Schneider R (2007) A database-driven ontology for German grammar. In: Rehm G, Witt A, Lemnitzer L (eds) Data structures for linguistic resources and applications, Narr, Tübingen, pp 305–314
Google Scholar
Schuurman I, Windhouwer M (2011) Explicit semantics for enriched documents. What do ISOcat, RELcat and SCHEMAcat have to offer?. In: 2nd Supporting Digital Humanities Conference, Copenhagen, Denmark
Google Scholar
Su L, Sung L, Huang S, Hsieh F, Lin Z (2008) NTU corpus of Formosan languages: A state-of-the-art report. Corpus Linguist Lingust Theory 4(2):291–294
Google Scholar
Telljohann H, Hinrichs E, Kübler S, Zinsmeister H, Beck K (2003) Stylebook for the Tübingen treebank of written German (TüBa-D/Z). Technical report, Seminar für Sprachwissenschaft, Universität Tübingen, Germany
Google Scholar
Tramp S, Frischmuth P, Arndt N, Ermilov T, Auer S (2011) Weaving a distributed, semantic social network for mobile users. In: 8th Extended Semantic Web Conference (ESWC-2011), Heraklion, Crete, pp 200–214
Google Scholar
Tyers F, Wiechetek L, Trosterud T (2009) Developing prototypes for machine translation between two Sámi languages. In: 13th Annual Conference of the European Association for Machine Translation (EAMT-2009), Barcelona, Spain, pp 120–127
Google Scholar
Vatant B, Wick M (2012) GeoNames ontology. http://www.geonames.org/ontology. Accessed 31 July 2012, version 3.01
Weibel S, Kunze J, Lagoze C, Wolf M (1998) RFC 2413 – Dublin core metadata for resource discovery. http://www.ietf.org/rfc/rfc2413.txt. Accessed 31 July 2012, Network Working Group
Windhouwer M, Wright S (2012) Linking to linguistic data categories in ISOcat. In: Chiarcos C, Nordhoff S, Hellmann S (eds) Linked data in linguistics, Springer, Heidelberg, pp 99–107
Chapter Google Scholar

Download references

Acknowledgements

In parts, this chapter is based on a number of earlier conference presentations, including [14, 18] and [38]. We would like to thank the contributors to these papers: Jonas Brekle, Philipp Cimiano, Judith Eckle-Kohler, Iryna Gurevych, Silvana Hartmann, Sebastian Hellmann, Jens Lehmann, Michael Matuschek, John McCrae, Christian M. Meyer, and Claus Stadler. We would also like to thank all other OWLG members, as well as the participants of LDL-2012. Further, we would like to express our gratitude towards the anonymous reviewers for feedback and comments. The research of the first author was partially supported by a DAAD postdoctoral fellowship at the Information Sciences Institute of the University of Southern California.

Author information

Authors and Affiliations

Goethe University Frankfurt am Main, Frankfurt am Main, Germany
Christian Chiarcos
Ludwig-Maximilians-Universität, München, Germany
Steven Moran
Freie Universität Berlin, Berlin, Germany
Pablo N. Mendes
Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany
Sebastian Nordhoff
Universität des Saarlandes, Saarbrücken, Germany
Richard Littauer

Authors

Christian Chiarcos
View author publications
You can also search for this author in PubMed Google Scholar
Steven Moran
View author publications
You can also search for this author in PubMed Google Scholar
Pablo N. Mendes
View author publications
You can also search for this author in PubMed Google Scholar
Sebastian Nordhoff
View author publications
You can also search for this author in PubMed Google Scholar
Richard Littauer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christian Chiarcos .

Editor information

Editors and Affiliations

Department of Computer Science Ubiquitous Knowledge Processing (UKP) Lab, Technische Universität Darmstadt, Darmstadt, Germany
Iryna Gurevych & Jungi Kim &

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Chiarcos, C., Moran, S., Mendes, P.N., Nordhoff, S., Littauer, R. (2013). Building a Linked Open Data Cloud of Linguistic Resources: Motivations and Developments. In: Gurevych, I., Kim, J. (eds) The People’s Web Meets NLP. Theory and Applications of Natural Language Processing. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35085-6_12

Download citation

DOI: https://doi.org/10.1007/978-3-642-35085-6_12
Published: 21 February 2013
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35084-9
Online ISBN: 978-3-642-35085-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics