Skip to main content

Towards Open Data for Linguistics: Linguistic Linked Data

  • Chapter
  • First Online:

Abstract

‘Open Data’ has become very important in a wide range of fields. However for linguistics, much data is still published in proprietary, closed formats and is not made available on the web. We propose the use of linked data principles to enable language resources to be published and interlinked openly on the web, and we describe the application of this paradigm to the modeling of two resources, WordNet and the MASC corpus. Here, WordNet and the MASC corpus serve as representative examples for two major classes of linguistic resources, lexical-semantic resources and annotated corpora, respectively.Furthermore, we argue that modeling and publishing language resources as linked data offers crucial advantages as compared to existing formalisms. In particular, it is explained how this can enhance the interoperability and the integration of linguistic resources. Further benefits of this approach include unambiguous identifiability of elements of linguistic description, the creation of dynamic, but unambiguous links between different resources, the possibility to query across distributed resources, and the availability of a mature technological infrastructure. Finally, recent community activities are described.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   54.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    The term ‘resource’ is ambiguous here. As understood in this chapter, resources are structured collections of data which can be represented, for example, in RDF. Hence, we prefer the terms ‘node’ or ‘concept’ whenever RDF resources are meant.

  2. 2.

    We provide a SPARQL endpoint under http://monnetproject.deri.ie/lemonsource_query, which provides access to the examples discussed in this chapter.

  3. 3.

    http://www.wordnet.princeton.edu

  4. 4.

    http://www.tagmatica.fr/lmf/LMF_revision_14_In_OWL29october2007.xml

  5. 5.

    http://en.wiktionary.org/

  6. 6.

    www.anc.org/MASC

  7. 7.

    Other domains where the linked data principles have been applied, include, e.g., geography [20], biomedicine [1], cultural history (http://www.europeana.eu) or government data (e.g., http://data.gov and http://data.gov.uk).

  8. 8.

    For example, the W3C Semantic Web Activity reported on developments for Media Resources, Data Provenance and Microdata in the first 2 weeks of February 2012

  9. 9.

    http://www4.wiwiss.fu-berlin.de/bizer/berlinsparqlbenchmark

  10. 10.

    Examples include http://swoogle.umbc.edu, http://www.sindice.net, http://swse.deri.ie, and http://watson.kmi.open.ac.uk.

  11. 11.

    http://linguistics.okfn.org

  12. 12.

    http://wiki.okfn.org/Wg/linguistics

  13. 13.

    http://linguistics.okfn.org/llod

  14. 14.

    http://www.w3.org/community/ontolex

References

  1. Ashburner, M., Ball, C.A., et al.: Gene ontology: tool for the unification of biology. Nat. Genet. 25(1), 25–29 (2000)

    Article  Google Scholar 

  2. Baker, C.F., Fillmore, C.J., Lowe, J.B.: The Berkeley FrameNet project. In: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics (ACL-1998), Montréal, pp. 86–90 (1998)

    Google Scholar 

  3. Bird, S., Liberman, M.: A formal framework for linguistic annotation. Speech Commun. 33(1), 23–60 (2001)

    Article  MATH  Google Scholar 

  4. Bizer, C., Heath, T., Berners-Lee, T.: Linked data – the story so far. Int. J. Semant. Web Inf. Syst. (IJSWIS) 5(3), 1–22 (2009)

    Google Scholar 

  5. Brandes, U., Eiglsperger, M., et al.: Graph markup language (GraphML). In: Tamassia, R. (ed.) Handbook of Graph Drawing and Visualization. Chapman & Hall/CRC, London (2010)

    Google Scholar 

  6. Buil-Aranda, C., Arenas, M., Corcho, O.: Semantics and optimization of the SPARQL 1.1 federation extension. In: The Semantic Web: Research and Applications, pp. 1–15. Springer, Heraklion (2011)

    Google Scholar 

  7. Carletta, J., Evert, S., et al.: The NITE XML Toolkit: data model and query. Lang. Resour. Eval. J. (LREJ) 39(4), 313–334 (2005)

    Google Scholar 

  8. Cassidy, S.: An RDF realisation of LAF in the DADA annotation server. In: Proceedings of the 5th Joint ISO-ACL/SIGSEM Workshop on Interoperable Semantic Annotation (ISO-5), Hong Kong (2010)

    Google Scholar 

  9. Chiarcos, C.: An ontology of linguistic annotations. LDV Forum 23(1), 1–16 (2008)

    Google Scholar 

  10. Chiarcos, C.: Interoperability of corpora and annotations. In Chiarcos, C., Nordhoff, S., Hellmann, S. (eds.) Linked Data in Linguistics, pp. 161–179. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  11. Chiarcos, C., Dipper, S., et al.: A flexible framework for integrating annotations from different tools and tagsets. TAL (Traitement automatique des langues) 49(2), 217–246 (2008)

    Google Scholar 

  12. Chiarcos, C., Hellmann, S., et al.: The open linguistics working group. In: Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC-2012), Istanbul (2012a)

    Google Scholar 

  13. Chiarcos, C., Nordhoff, S., Hellmann, S. (eds.): Linked Data in Linguistics. Representing Language Data and Metadata. Springer, Heidelberg (2012b)

    Google Scholar 

  14. Chiarcos, C., Ritz, J., Stede, M.: By all these lovely tokens …Merging conflicting tokenizations. J. Lang. Resour. Eval. (LREJ) 4(45), 53–74 (2012c)

    Google Scholar 

  15. Dipper, S.: XML-based stand-off representation and exploitation of multi-level linguistic annotation. In: Eckstein, R., Tolksdorf, R. (eds.) Proceedings of Berliner XML Tage 2005 (BXML-2005), Berlin, pp. 39–50 (2005)

    Google Scholar 

  16. Farrar, S., Langendoen, D.T.: An OWL-DL implementation of GOLD: an ontology for the Semantic Web. In: Witt, A., Metzing, D. (eds.) Linguistic Modeling of Information and Markup Languages. Springer, Dordrecht (2010)

    Google Scholar 

  17. Fellbaum, C.: WordNet. MIT, Cambridge (1998)

    MATH  Google Scholar 

  18. Fielding, R., Gettys, J., et al.: Hypertext transfer protocol – HTTP/1.1. Internet RFC 2068 (1997)

    Google Scholar 

  19. Francopoulo, G., George, M., et al.: Lexical markup framework (LMF). In: Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC-2006), Genoa (2006)

    Google Scholar 

  20. Goodwin, J., Dolbear, C., Hart, G.: Geographical linked data: the administrative geography of Great Britain on the Semantic Web. Trans. GIS 12, 19–30 (2008)

    Article  Google Scholar 

  21. Guéret, C., Kotoulas, S., Groth, P.: TripleCloud: an infrastructure for exploratory querying over web-scale RDF data. In: Proceedings of the 2011 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT 2011), Lyon, pp. 245–248 (2011)

    Google Scholar 

  22. Gurevych, I., Eckle-Kohler, J., et al.: Uby – a large-scale unified lexical semantic resource based on LMF. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL-2012), Avignon, pp. 580–590 (2012)

    Google Scholar 

  23. Hartig, O., Bizer, C., Freytag, J.C.: Executing SPARQL queries over the web of linked data. In: The Semantic Web – ISWC 2009, Heraklion, pp. 293–309 (2009)

    Google Scholar 

  24. Holtman, K., Mutz, A.: Transparent content negotiation in HTTP. Internet RFC 2295 (1998)

    Google Scholar 

  25. Ide, N., Pustejovsky, J.: What does interoperability mean, anyway? Toward an operational definition of interoperability. In: Proceedings of the 2nd International Conference on Global Interoperability for Language Resources (ICGL 2010), Hong Kong (2010)

    Google Scholar 

  26. Ide, N., Suderman, K.: GrAF: A graph-based format for linguistic annotations. In: Proceedings of the First Linguistic Annotation Workshop (LAW 2007), Prague, pp. 1–8 (2007)

    Google Scholar 

  27. Ide, N., Le Maitre, J., Véronis, J.: Outline of a model for lexical databases. In: Zampolli, A., Calzolari, N., Palmer, M.S. (eds.) Current Issues in Computational Linguistics: In Honour of Don Walker, Giardini, pp. 283–320 (1995)

    Google Scholar 

  28. Ide, N., Fellbaum, C., et al.: The manually annotated sub-corpus: a community resource for and by the people. In: Proceedings of the ACL 2010 Conference Short Papers, Uppsala, pp. 68–73 (2010)

    Google Scholar 

  29. Klyne, G., Carroll, J.J, McBride, B.: Resource description framework (RDF): concepts and abstract syntax. Technical report, W3C Recommendation (2004)

    Google Scholar 

  30. Marcus, M.P., Santorini, B., Marcinkiewicz, M.A.: Building a large annotated corpus of english: the penn treebank. Comput. Linguist. 19(2), 313–330 (1994)

    Google Scholar 

  31. McCrae, J., Spohr, D., Cimiano, P.: Linking lexical resources and ontologies on the Semantic Web with Lemon. In: The Semantic Web: Research and Applications, Heraklion, pp. 245–259 (2011)

    Google Scholar 

  32. McCrae, J., Montiel-Ponsoda, E., Cimiano, P.: Collaborative semantic editing of linked data lexica. In: Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC-2012), Istanbul (2012a)

    Google Scholar 

  33. McCrae, J., Montiel-Ponsoda, E., Cimiano, P.: Integrating WordNet and wiktionary with lemon. In: Chiarcos, C., Nordhoff, S., Hellmann, S. (eds.) Linked Data in Linguistics, pp. 25–34, Springer, Heidelberg (2012b)

    Google Scholar 

  34. Miller, G.A.: WordNet: a lexical database for English. Commun. ACM 38(11), 39–41 (1995)

    Article  Google Scholar 

  35. Prud’Hommeaux, E., Seaborne, A.: SPARQL query language for RDF. W3C working draft (2008)

    Google Scholar 

  36. Quilitz, B., Leser, U.: Querying distributed RDF data sources with SPARQL. In: The Semantic Web: Research and Applications, pp. 524–538. Springer, Berlin/Heidelberg (2008)

    Google Scholar 

  37. Schenk, S., Petrák, J.: Sesame RDF repository extensions for remote querying. In: Proceedings of the 7th Znalosti Conference (Znalosti-2008), Bratislava (2008)

    Google Scholar 

  38. Shadbolt, N., Hall, W., Berners-Lee, T.: The semantic web revisited. IEEE Intell. Syst. 21(3), 96–101 (2006)

    Article  Google Scholar 

  39. Van Assem, M., Gangemi, A., Schreiber, G.: Conversion of WordNet to a standard RDF/OWL representation. In: Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC-2006), Genoa, pp. 237–242 (2006)

    Google Scholar 

  40. Véronis, J., Ide, N.: A feature-based model for lexical databases. In: Proceedings of the 14th International Conference on Computational Linguistics (COLING-1992), Nantes, pp. 588–594 (1992)

    Google Scholar 

  41. Windhouwer, M., Wright, S.E.: Linking to linguistic data categories in ISOcat. In: Chiarcos, C., Nordhoff, S., Hellmann, S. (eds.) Linked Data in Linguistics, pp. 99–107. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

Download references

Acknowledgements

The work of Christian Chiarcos was supported by a postdoc fellowship of the German Academic Exchange Service (DAAD). The work of John McCrae and Philipp Cimiano was developed in the context of the Monnet project, which is funded by the European Union FP7 program under grant number 248458 and the CITEC excellence initiative funded by the DFG (Deutsche Forschungsgemeinschaft). Christiane Fellbaum’s work is supported by a grant from the U.S. National Science Foundation (CNS 0855157). We would also like to thank Nancy Ide and two anonymous reviewers for valuable comments and feedback.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christian Chiarcos .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Chiarcos, C., McCrae, J., Cimiano, P., Fellbaum, C. (2013). Towards Open Data for Linguistics: Linguistic Linked Data. In: Oltramari, A., Vossen, P., Qin, L., Hovy, E. (eds) New Trends of Research in Ontologies and Lexical Resources. Theory and Applications of Natural Language Processing. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31782-8_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-31782-8_2

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-31781-1

  • Online ISBN: 978-3-642-31782-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics