Skip to main content

Building a Linked Open Data Cloud of Linguistic Resources: Motivations and Developments

  • Chapter
  • First Online:
The People’s Web Meets NLP

Abstract

We describe on going community-efforts to create a Linked Open Data (sub-)cloud of linguistic resources, with an emphasis on resources that are specific to linguistic research, namely annotated corpora and linguistic databases. We argue that for both types of resources, the application of the Linked Open Data paradigm and the representation in RDF represents a promising approach to address interoperability problems, and to integrate information from different repositories. This is illustrated with example studies for different kinds of linguistic resources.The efforts described in this chapter are conducted in the context of the Open Linguistics Working Group (OWLG) of the Open Knowledge Foundation. The OWLG is a network of researchers interested in linguistic resources and/or their publication under open licenses, and a number of its members are engaged in the application of the Linked Open Data paradigm to their resources. Under the umbrella of the OWLG, these efforts will eventually emerge in the creation of a Linguistic Linked Open Data cloud (LLOD).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The term ‘resource’ is ambiguous here. As understood in this chapter, resources are structured collections of data which can be represented, for example, using RDF. In RDF, however, ‘resource’ is the conventional name of a node in the graph, because, historically, these nodes were meant to represent objects that are described by metadata. Hence, we use the terms ‘node’ or ‘concept’ whenever RDFresources are meant.

  2. 2.

    Federation is possible with SPARQL, although not necessarily very performant with state-of-the-art implementations. A more efficient way than federation is thus to retrieve the content necessary for a particular application from another end point and to query it locally. SPARQL end points provide this functionality, and publishing data under open licenses (see below) warantees that the necessary legal preconditions for this practice are met.

  3. 3.

    http://www.w3.org/DesignIssues/LinkedData.html, paragraph ‘Is your Linked Open Data 5 Star?’

  4. 4.

    Although the application of RDF to linguistic resources as described here has been occasionally suggested, see [11, 13] for linguistic corpora, but these approaches focused on the RDF representation of individual resources rather than linking them with other types of linguistic resources. As opposed to this, the focus of this chapter is not on modeling linguistic resources, but rather, on the potential to linking these with each other.

  5. 5.

    http://www.anc.org/MASC

  6. 6.

    http://dbpedia.org/

  7. 7.

    http://live.dbpedia.org/resource/Byzantine_Empire

  8. 8.

    Note that [68] have also defined a TF-ICF measure (where “C” stands for “corpus”) with the objective to generate vectors of streaming documents in linear time. In DBpedia Spotlight’s TF*ICF (where “C” stands for “candidate”) the objective is to give more weight to words that are rare among confusable entities.

  9. 9.

    http://spotlight.dbpedia.org/demo

  10. 10.

    http://purl.org/vocabularies/princeton/wn30/synset-Byzantine-adjective-2

  11. 11.

    http://glottolog.livingsources.org

  12. 12.

    http://bibliontology.com/

  13. 13.

    http://language-archives.org

  14. 14.

    http://www.ethnologue.com

  15. 15.

    http://multitree.org/

  16. 16.

    http://wals.info

  17. 17.

    http://phoible.org

  18. 18.

    http://lingweb.eva.mpg.de/ids/

  19. 19.

    http://languagelink.let.uu.nl/tds

  20. 20.

    https://github.com/SebastianNordhoff/LingTyp.owl

  21. 21.

    Compiled by Eugene Chang, http://lingweb.eva.mpg.de/numeral/

  22. 22.

    http://ontowiki.net

  23. 23.

    http://linguistics-ontology.org

  24. 24.

    http://linguistlist.org/

  25. 25.

    http://www.isocat.org

  26. 26.

    http://purl.org/olia/penn-syntax.owl

  27. 27.

    http://purl.org/olia/penn-syntax-link.rdf

  28. 28.

    http://www.lexvo.org

  29. 29.

    http://www.lingvoj.org

  30. 30.

    http://linguistics.okfn.org

  31. 31.

    http://okfn.org/

  32. 32.

    http://opendefinition.org

  33. 33.

    For example, complex copyright situations may arise if one resource (say, a lexicon) was developed on the basis of another resource (say, a newspaper archive), and researchers are uncertain whether the examples from the original newspaper contained in the lexicon violate the original copyright. Ethical problems may arise if a data base of quotations from a newspaper is linked to a data base of speakers, and this data base is further connected with, say, obituaries from the same newspaper. Even if this was done only in order to study generation-specific language variation, one may wonder whether such an accumulation of information violates the privacy of the people involved.

  34. 34.

    http://lod-cloud.net

  35. 35.

    http://framenet.icsi.berkeley.edu

References

  1. Abney S, Bird S (2010) The Human Language Project: Building a universal corpus of the world’s languages. In: 48th Annual Meeting of the Association for Computational Linguistics (ACL-2010), Uppsala, Sweden, pp 88–97

    Google Scholar 

  2. Baker C, Fellbaum C (2009) WordNet and FrameNet as complementary resources for annotation. In: 3rd Linguistic Annotation Workshop (LAW-2009), Suntec, Singapore, pp 125–129

    Google Scholar 

  3. Bakker D, Dahl O, Haspelmath M, Koptjevskaja-Tamm M, Lehmann C, Siewierska A (1993) EUROTYP guidelines. Technical report, European Science Foundation Programme in Language Typology

    Google Scholar 

  4. Baker C, Fillmore C, Lowe J (1998) The Berkeley FrameNet project. In: 36th Annual Meeting of the Association for Computational Linguistics (ACL-1998), Montréal, Canada, pp 86–90

    Google Scholar 

  5. Bender E (2008) Evaluating a crosslinguistic grammar resource: A case study of Wambaya. In: 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-2008: HLT), Columbus, Ohio, pp 977–985

    Google Scholar 

  6. Berners-Lee T (2006) Design issues: Linked data. http://www.w3.org/DesignIssues/LinkedData.html. Accessed 31 July 2012

  7. Bies A, Ferguson M, Katz K, MacIntyre R (1995) Bracketing guidelines for Treebank II style Penn Treebank project. ftp://ftp.cis.upenn.edu/pub/treebank/doc/manual/root.ps.gz. Accessed 31 July 2012, version of January 1995

  8. Bird S, Liberman M (2001) A formal framework for linguistic annotation. Speech Commun 33(1-2):23–60

    Article  Google Scholar 

  9. Brandes U, Eiglsperger M, Herman I, Himsolt M, Marshall M (2002) GraphML progress report: Structural layer proposal. In: 9th International Symposium on Graph Drawing (GD-2001), Vienna, Austria, pp 501–512

    Google Scholar 

  10. Brown C, Holman E, Wichmann S, Velupillai V (2008) Automated classification of the world’s languages. STUF Lang Typol Univers 61(4):286–308

    Google Scholar 

  11. Burchardt A, Padó S, Spohr D, Frank A, Heid U (2008) Formalising multi-layer corpora in OWL/DL – Lexicon modelling, querying and consistency control. In: 3rd International Joint Conference on NLP (IJCNLP-2008), Hyderabad, India

    Google Scholar 

  12. Carletta J, Evert S, Heid U et al (2003) The NITE XML toolkit: Flexible annotation for multi-modal language data. Behav Res Methods Instrum Comput 35(3):353–363

    Article  Google Scholar 

  13. Cassidy S (2010) An RDF realisation of LAF in the DADA annotation server. In: 5th Joint ISO-ACL/SIGSEM Workshop on Interoperable Semantic Annotation (ISA-5), Hong Kong, China

    Google Scholar 

  14. Chiarcos C (2012) A generic formalism to represent linguistic corpora in RDF and OWL/DL. In: 8th International Conference on Language Resources and Evaluation (LREC-2012), Istanbul, Turkey, pp 3205–3212

    Google Scholar 

  15. Chiarcos C (2012) Ontologies of linguistic annotation: Survey and perspectives. In: 8th International Conference on Language Resources and Evaluation (LREC-2012), Istanbul, Turkey, pp 303–310

    Google Scholar 

  16. Chiarcos C (2012) POWLA: Modeling linguistic corpora in OWL/DL. In: 9th Extended Semantic Web Conference (ESWC-2012), Heraklion, Crete, pp 225–239

    Google Scholar 

  17. Chiarcos C, Dipper S, Götze M, Leser U, Lüdeling A, Ritz J, Stede M (2008) A flexible framework for integrating annotations from different tools and tagsets. TAL (Traitement automatique des langues) 49(2):217–246

    Google Scholar 

  18. Chiarcos C, Hellmann S, Nordhoff S, Moran S, Littauer R, Eckle-Kohler J, Gurevych I, Hartmann S, Matuschek M, Meyer C (2012) The Open Linguistics Working Group. In: 8th International Conference on Language Resources and Evaluation (LREC-2012), Istanbul, Turkey, pp 3603–3610

    Google Scholar 

  19. Chiarcos C, Nordhoff S, Hellmann S (eds) (2012) Linked data in linguistics. Representing and connecting language data and language metadata. Springer, Heidelberg

    Google Scholar 

  20. Chiarcos C, McCrae J, Cimiano P, Fellbaum C (2012) Towards open data for linguistics: Linguistic linked data. In: Oltramari A, Lu-Qin, Vossen P, Hovy E (eds) New Trends of Research in Ontologies and Lexical Resources. Springer, Heidelberg

    Google Scholar 

  21. Corbett G (2005) Number of genders. In: Haspelmath M, Dryer M, Gil D, Comrie B (eds) The World Atlas of Language Structures. Oxford University Press, Oxford

    Google Scholar 

  22. Declerck T (2006) SynAF: Towards a standard for syntactic annotation. In: 5th International Conference on Language Resources and Evaluation (LREC-2006), Genoa, Italy, pp 229–232

    Google Scholar 

  23. Dimitriadis A, Everaert M, Reinhart T, Reuland E (2005) Anaphora typology database. http://languagelink.let.uu.nl/anatyp. Accessed 31 July 2012

  24. Dostert L (1955) The Georgetown-IBM experiment. In: Locke W, Booth A (eds) Machine translation of languages. Wiley, New York, pp 124–135

    Google Scholar 

  25. Dryer M (1997) On the six-way word order typology. Stud Lang 21(1):69–103

    Article  Google Scholar 

  26. Eckart R (2008) Choosing an XML database for linguistically annotated corpora. Sprache und Datenverarbeitung 32(1):7–22

    Google Scholar 

  27. Farrar S, Langendoen DT (2003) A linguistic ontology for the Semantic Web. GLOT Int 7:97–100

    Google Scholar 

  28. Farrar S, Langendoen DT (2010) An OWL-DL implementation of GOLD: An ontology for the Semantic Web. In: Witt A, Metzing D (eds) Linguistic modeling of information and markup languages, Springer, Dordrecht

    Google Scholar 

  29. Francis WN, Kucera H (1964) Brown Corpus manual, revised edition. Technival report, Brown University, Providence, Rhode Island, 1979

    Google Scholar 

  30. Francopoulo G, George M, Calzolari N, Monachini M, Bel N, Pet M, Soria C (2006) Lexical Markup Framework (LMF). In: 5th International Conference on Language Resources and Evaluation (LREC-2006), Genoa, Italy, pp 233–236

    Google Scholar 

  31. Gangemi A, Navigli R, Velardi P (2003) The OntoWordNet project: Extension and axiomatization of conceptual relations in WordNet. In: Meersman R, Tari Z (eds) Proceedings of On the Move to Meaningful Internet Systems (OTM-2003), Catania, Italy, pp 820–838

    Google Scholar 

  32. Good J, Hendryx-Parker C (2006) Modeling contested categorization in linguistic databases. In: EMELD Workshop on Digital Language Documentation, East Lansing, MI

    Google Scholar 

  33. Greenberg J (1960) A quantitative approach to the morphological typology of languages. Int J Am Linguist 26:178–194

    Article  Google Scholar 

  34. Haspelmath M, Tadmor U (eds) (2009) World Loanword Database. Max Planck Digital Library, Munich

    Google Scholar 

  35. Haspelmath M, Dryer M, Gil D, Comrie B (eds) (2008) The World Atlas of Language Structures online. Max Planck Digital Library, Munich

    Google Scholar 

  36. Hellmann S, Unbehauen J, Chiarcos C, Ngonga Ngomo AC (2010) The TIGER Corpus Navigator. In: 9th International Workshop on Treebanks and Linguistic Theories (TLT-9), Tartu, Estonia, pp 91–102

    Google Scholar 

  37. Hellmann S, Lehmann J, Auer S (2012) Linked-data aware URI schemes for referencing text fragments. In: 18th International Conference on Knowledge Engineering and Knowledge Management (EKAW-2012), Galway, Ireland

    Google Scholar 

  38. Hellmann S, Stadler C, Lehmann J (2012) The German DBpedia: A sense repository for linking entities. In: Chiarcos C, Nordhoff S, Hellmann S (eds) Linked data in linguistics, Springer, Heidelberg, pp 181–190

    Chapter  Google Scholar 

  39. Hwa R, Resnik P, Weinberg A, Cabezas C, Kolak O (2005) Bootstrapping parsers via syntactic projection across parallel texts. Nat Lang Eng 11(3):311–325

    Article  Google Scholar 

  40. Ide N, Pustejovsky J (2010) What does interoperability mean, anyway? Toward an operational definition of interoperability. In: 2nd International Conference on Global Interoperability for Language Resources (ICGL-2010), Hong Kong, China

    Google Scholar 

  41. Ide N, Romary L (2004) International standard for a linguistic annotation framework. Nat Lang Eng 10(3-4):211–225

    Article  Google Scholar 

  42. Ide N, Romary L (2004) A registry of standard data categories for linguistic annotation. In: 4th International Conference on Language Resources and Evaluation (LREC-2004), Lisbon, Portugal, pp 135–139

    Google Scholar 

  43. Ide N, Romary L (2006) Representing linguistic corpora and their annotations. In: 5th International Conference on Language Resources and Evaluation (LREC-2006), Genoa, Italy, pp 225–228

    Google Scholar 

  44. Ide N, Suderman K (2007) GrAF: A graph-based format for linguistic annotations. In: 1st Linguistic Annotation Workshop (LAW-2007), Prague, Czech Republic, pp 1–8

    Google Scholar 

  45. Ide N, Baker CF, Fellbaum C, Fillmore CJ, Passonneau R (2008) MASC: The Manually Annotated Sub-Corpus of American English. In: 6th International Conference on Language Resources and Evaluation (LREC-2008), Marrakech, Morocco, pp 2455–2461

    Google Scholar 

  46. Kemps-Snijders M, Windhouwer M, Wittenburg P, Wright S (2008) ISOcat: Corralling data categories in the wild. In: 6th International Conference on Language Resources and Evaluation (LREC-2008), Marrakech, Morocco, pp 887–891

    Google Scholar 

  47. Kilgarriff A, Grefenstette G (2003) Introduction to the special issue on the Web as Corpus. Comput Linguist 29(3):333–347

    Article  Google Scholar 

  48. Leech G, Wilson A (1996) EAGLES recommendations for the morphosyntactic annotation of corpora. http://www.ilc.cnr.it/EAGLES/annotate/annotate.html. Accessed 31 July 2012

  49. Lehmann J, Bizer C, Kobilarov G et al (2009) DBpedia – A crystallization point for the Web of Data. J Web Semant 7(3):154–165

    Article  Google Scholar 

  50. Lewis W (2010) Haitian Creole: How to build and ship an MT engine from scratch in 4 days, 17 hours, & 30 minutes. In: 14th Annual Conference of the European Association for Machine Translation (EAMT-2010), Saint-Raphaël, France

    Google Scholar 

  51. Lux M, Laußmann J, Mehler A, Menßen C (2011) An online platform for visualizing lexical networks. In: 2011 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT 2011), Lyons, France, pp 495–496

    Google Scholar 

  52. Maddieson I (1984) Patterns of Sounds. Cambridge University Press, Cambridge/New York

    Book  Google Scholar 

  53. McClanahan P, Busby G, Haertel R, Heal K, Lonsdale D, Seppi K, Ringger E (2010) A probabilistic morphological analyzer for Syriac. In: 14th Conference on Empirical Methods on Natural Language Processing (EMNLP-2010), Cambridge, MA, pp 810–820

    Google Scholar 

  54. McCrae J, Spohr D, Cimiano P (2011) Linking lexical resources and ontologies on the semantic web with Lemon. In: 8th Extended Semantic Web Conference (ESWC-2011), Heraklion, Crete. Springer, pp 245–259

    Google Scholar 

  55. Mendes P, Jakob M, García-Silva A, Bizer C (2011) DBpedia Spotlight: Shedding light on the Web of Documents. In: 7th International Conference on Semantic Systems (I-Semantics 2011), Graz, Austria

    Google Scholar 

  56. Mendes P, Daiber J, Rajapakse R et al (2012) Evaluating the impact of phrase recognition on concept tagging. In: 8th International Conference on Language Resources and Evaluation (LREC-2012), Istanbul, Turkey

    Google Scholar 

  57. Mendes P, Jakob M, Bizer C (2012) DBpedia for NLP: A multilingual cross-domain knowledge base. In: 8th International Conference on Language Resources and Evaluation (LREC-2012), Istanbul, Turkey

    Google Scholar 

  58. Meyers A, Ide N, Denoyer L, Shinyama Y (2007) The shared corpora working group report. In: 1st Linguistic Annotation Workshop (LAW-2007), Prague, Czech Republic, pp 184–190

    Google Scholar 

  59. Michaelis S, Maurer P, Haspelmath M, Huber M (eds) (to appear 2013) Atlas of Pidgin and Creole Language Structures. Oxford University Press, Oxford

    Google Scholar 

  60. Moran S (2012) Phonetics information base and lexicon. PhD thesis, University of Washington

    Google Scholar 

  61. Moran S (2012) Using linked data to create a typological knowledge base. In: Chiarcos C, Nordhoff S, Hellmann S (eds) Linked data in linguistics. Springer, Heidelberg, pp 129–138

    Chapter  Google Scholar 

  62. Moran S, McCloy D, Wright R (2012) Revisiting the population vs phoneme inventory correlation. In: 86th Annual Meeting of the Linguistic Society of America (LSA-2012), Portland, OR

    Google Scholar 

  63. Morris W (ed) (1969) The American Heritage dictionary of the English language. Houghton Mifflin, New York

    Google Scholar 

  64. Nordhoff S, Hammarström H (2011) Glottolog/Langdoc: Defining dialects, languages, and language families as collections of resources. In: 1st International Workshop on Linked Science (LISC-2011), Bonn, Germany

    Google Scholar 

  65. Pederson T (2008) Empiricism is not a matter of faith. Comput Linguist 34(3):465–470

    Article  Google Scholar 

  66. Ponzetto S, Navigli R (2009) Large-scale taxonomy mapping for restructuring and integrating Wikipedia. In: 21st International Joint Conference on Artificial Intelligence (IJCAI-2009), Pasadena, CA, pp 2083–2088

    Google Scholar 

  67. Prud’Hommeaux E, Seaborne A (2008) SPARQL query language for RDF. W3C Recommendation. http://www.w3.org/TR/rdf-sparql-query, Accessed Dec, 20th, 2013

  68. Reed JW, Jiao Y, Potok TE, Klump BA, Elmore MT, Hurson AR (2006) TF-ICF: A new term weighting scheme for clustering dynamic data streams. In: 5th International Conference on Machine Learning and Applications (ICMLA-2006), Washington, DC, pp 258–263

    Google Scholar 

  69. Romary L, Zeldes A, Zipser F (2011) [Tiger2/] – serialising the ISO SynAF syntactic object model. Arxiv preprint arXiv:11080631

    Google Scholar 

  70. Schneider R (2007) A database-driven ontology for German grammar. In: Rehm G, Witt A, Lemnitzer L (eds) Data structures for linguistic resources and applications, Narr, Tübingen, pp 305–314

    Google Scholar 

  71. Schuurman I, Windhouwer M (2011) Explicit semantics for enriched documents. What do ISOcat, RELcat and SCHEMAcat have to offer?. In: 2nd Supporting Digital Humanities Conference, Copenhagen, Denmark

    Google Scholar 

  72. Su L, Sung L, Huang S, Hsieh F, Lin Z (2008) NTU corpus of Formosan languages: A state-of-the-art report. Corpus Linguist Lingust Theory 4(2):291–294

    Google Scholar 

  73. Telljohann H, Hinrichs E, Kübler S, Zinsmeister H, Beck K (2003) Stylebook for the Tübingen treebank of written German (TüBa-D/Z). Technical report, Seminar für Sprachwissenschaft, Universität Tübingen, Germany

    Google Scholar 

  74. Tramp S, Frischmuth P, Arndt N, Ermilov T, Auer S (2011) Weaving a distributed, semantic social network for mobile users. In: 8th Extended Semantic Web Conference (ESWC-2011), Heraklion, Crete, pp 200–214

    Google Scholar 

  75. Tyers F, Wiechetek L, Trosterud T (2009) Developing prototypes for machine translation between two Sámi languages. In: 13th Annual Conference of the European Association for Machine Translation (EAMT-2009), Barcelona, Spain, pp 120–127

    Google Scholar 

  76. Vatant B, Wick M (2012) GeoNames ontology. http://www.geonames.org/ontology. Accessed 31 July 2012, version 3.01

  77. Weibel S, Kunze J, Lagoze C, Wolf M (1998) RFC 2413 – Dublin core metadata for resource discovery. http://www.ietf.org/rfc/rfc2413.txt. Accessed 31 July 2012, Network Working Group

  78. Windhouwer M, Wright S (2012) Linking to linguistic data categories in ISOcat. In: Chiarcos C, Nordhoff S, Hellmann S (eds) Linked data in linguistics, Springer, Heidelberg, pp 99–107

    Chapter  Google Scholar 

Download references

Acknowledgements

In parts, this chapter is based on a number of earlier conference presentations, including [14, 18] and [38]. We would like to thank the contributors to these papers: Jonas Brekle, Philipp Cimiano, Judith Eckle-Kohler, Iryna Gurevych, Silvana Hartmann, Sebastian Hellmann, Jens Lehmann, Michael Matuschek, John McCrae, Christian M. Meyer, and Claus Stadler. We would also like to thank all other OWLG members, as well as the participants of LDL-2012. Further, we would like to express our gratitude towards the anonymous reviewers for feedback and comments. The research of the first author was partially supported by a DAAD postdoctoral fellowship at the Information Sciences Institute of the University of Southern California.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christian Chiarcos .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Chiarcos, C., Moran, S., Mendes, P.N., Nordhoff, S., Littauer, R. (2013). Building a Linked Open Data Cloud of Linguistic Resources: Motivations and Developments. In: Gurevych, I., Kim, J. (eds) The People’s Web Meets NLP. Theory and Applications of Natural Language Processing. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35085-6_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-35085-6_12

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-35084-9

  • Online ISBN: 978-3-642-35085-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics