Skip to main content

Interoperability of Corpora and Annotations

  • Chapter

Abstract

This paper describes the application of OWL and RDF to address the interoperability of linguistic corpora and linguistic annotations within such corpora. Interoperability of linguistic corpora involves two aspects: Structural interoperability (annotations of different origin are represented using the same formalism) and conceptual interoperability (annotations of different origin are linked to a common vocabulary).

Building on an existing infrastructure developed to represent, to store, to query and to visualize multi-layer corpora with any kind of text-oriented annotation, this paper proposes to address both aspects by means of OWL/RDF-based formalisms. Key advantages of this approach include the existence of a rich technological ecosystem developed around RDF and OWL, the conceptual similarity of generic data models for linguistic annotations and RDF (both based on labeled directed graphs), and the application of OWL/DL reasoners that can be applied to validate the consistency of linguistic corpora and their annotations and to infer additional information that is relevant, for example, for their appropriate visualization.

Additionally, representing corpora in OWL and RDF allows to interlink resources freely, e.g., different annotation layers of a multi-layer corpus, translated texts in parallel corpora, or linguistic corpora and lexical-semantic resources. Modeled in this way, corpora can be fully integrated in a Linked Open Data (sub-)cloud of linguistic resources, along with lexical-semantic resources and knowledge bases of information about languages and linguistic terminology.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   54.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Bickel B, Nichols J (2002) Autotypologizing databases and their use in fieldwork. In: Proceedings of the LREC-2002 Workshop on Resources and Tools in Field Linguistics, Las Palmas, Spain

    Google Scholar 

  • Bird S, Liberman M (2001) A formal framework for linguistic annotation. Speech Communication 33(1-2):23–60

    Article  MATH  Google Scholar 

  • Boersma P (2002) Praat, a system for doing phonetics by computer. Glot international 5(9/10):341–345

    Google Scholar 

  • Bouda P, Cysouw M (this vol.) Treating dictionaries as a Linked-Data corpus. pp 15–23

    Google Scholar 

  • Brants S, Dipper S, Eisenberg P, Hansen S, König E, Lezius W, Rohrer C, Smith G, Uszkoreit H (2004) TIGER: Linguistic interpretation of a German corpus. Research on Language and Computation 2(4):597–620

    Article  Google Scholar 

  • Burchardt A, Padó S, Spohr D, Frank A, Heid U (2008) Formalising Multi-layer Corpora in OWL/DL – Lexicon Modelling, Querying and Consistency Control. In: Proceedings of the 3rd International Joint Conference on NLP (IJCNLP 2008), Hyderabad

    Google Scholar 

  • Busemann A, Busemann K (2008) Toolbox self-training. Tech. rep., http://www.sil.org. Version 1.5.4, Oct 2008

  • Buyko E, Chiarcos C, Pareja-Lora A (2008) Ontology-based interface specifications for a NLP pipeline architecture. In: Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC 2008), Marrakech, Morocco

    Google Scholar 

  • Carletta J, Evert S, Heid U, Kilgour J (2005) The NITE XML Toolkit: data model and query. Language Resources and Evaluation Journal (LREJ) 39(4):313–334

    Article  Google Scholar 

  • Aguado de Cea G, Gomez-Perez A, Alvarez de Mon I, Pareja-Lora A (2004) OntoTag’s linguistic ontologies. In: Proc. Information Technology: Coding and Computing (ITCC’04), Washington, DC, USA

    Google Scholar 

  • Chiarcos C (2008) An ontology of linguistic annotations. LDV Forum 23(1):1–16

    Google Scholar 

  • Chiarcos C (2010a) Grounding an ontology of linguistic annotations in the Data Category Registry. In: LREC 2010 Workshop on Language Resource and Language Technology Standards (LT&LTS), Valetta, Malta, pp 37–40

    Google Scholar 

  • Chiarcos C (2010b) Towards robust multi-tool tagging. An OWL/DL-based approach. In: ACL 2010, Uppsala, Sweden, pp 659–670

    Google Scholar 

  • Chiarcos C, Erjavec T (2011) OWL/DL formalization of the MULTEXT-East morphosyntactic specifications. In: 5th Linguistic Annotation Workshop, Portland, pp 11–20

    Google Scholar 

  • Chiarcos C, Dipper S, Götze M, Leser U, Lüdeling A, Ritz J, Stede M (2008) A Flexible Framework for Integrating Annotations from Different Tools and Tagsets. TAL (Traitement automatique des langues) 49(2)

    Google Scholar 

  • Chiarcos C, Ritz J, Stede M (2011) By all these lovely tokens ... Merging conflicting tokenizations. Journal of Language Resources and Evaluation (LREJ) 4(45). to appear

    Google Scholar 

  • Dipper S (2005) XML-based stand-off representation and exploitation of multi-level linguistic annotation. In: Proc. Berliner XML Tage 2005 (BXML 2005), Berlin, Germany, pp 39–50

    Google Scholar 

  • Eckart K, Riester A, Schweitzer K (this vol.) A discourse information radio news database for linguistic analysis. pp 65–75

    Google Scholar 

  • Erjavec T (2004) MULTEXT-East version 3: Multilingual morphosyntactic specifications, lexicons and corpora. In: Fourth International Conference on Language Resources and Evaluation, (LREC 2004), Lisboa, Portugal, pp 1535–1538

    Google Scholar 

  • Farrar S, Langendoen D (2003) A linguistic ontology for the semantic web. Glot International 7(3):97–100

    Google Scholar 

  • Hellmann S, Unbehauen J, Chiarcos C, Ngonga Ngomo A (2010) The TIGER Corpus Navigator. In: 9th International Workshop on Treebanks and Linguistic Theories (TLT-9), Tartu, Estonia, pp 91–102

    Google Scholar 

  • Hellmann S, Stadler C, Lehmann J (this vol.) The German DBpedia: A sense repository for linking entities. pp 181–189

    Google Scholar 

  • Hellwig B, Uytvanck DV, Hulsbosch M (2008) ELAN - Linguistic Annotator. Tech. rep., http://www.lat-mpi.eu/tools/elan. version of 2008-07-31

  • Ide N, Pustejovsky J (2010) What does interoperability mean, anyway? Toward an operational definition of interoperability. In: Proc. Second International Conference on Global Interoperability for Language Resources (ICGL 2010), Hong Kong, China

    Google Scholar 

  • Ide N, Romary L (2004) International standard for a linguistic annotation framework. Natural language engineering 10(3-4):211–225

    Article  Google Scholar 

  • Ide N, Suderman K (2007) GrAF: A graph-based format for linguistic annotations. In: Proc. Linguistic Annotation Workshop (LAW 2007), Prague, Czech Republic, pp 1–8

    Google Scholar 

  • Kemps-Snijders M (2010) Relish: Rendering endangered languages lexicons interoperable through standards harmonisation. In: 7th SaLTMiL Workshop on Creation and use of basic lexical resources for less-resourced languages, held in conjunction with LREC 2010, Valetta, Malta

    Google Scholar 

  • Kemps-Snijders M, Windhouwer M, Wittenburg P, Wright S (2009) ISOcat: Remodelling metadata for language resources. International Journal of Metadata, Semantics and Ontologies 4(4):261–276

    Article  Google Scholar 

  • Kingsbury P, Palmer M (2002) From treebank to propbank. In: Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC-2002), Citeseer, pp 1989–1993

    Google Scholar 

  • König E, Lezius W (2000) A description language for syntactically annotated corpora. In: Proc. 18th International Conference on Computational Linguistics (COLING 2000), Saarbrücken, Germany, pp 1056–1060

    Google Scholar 

  • Leech G, Wilson A (1996) EAGLES recommendations for the morphosyntactic annotation of corpora. http://www.ilc.cnr.it/EAGLES/annotate/annotate.html, version of March 1996

  • Lezius W (2002) TIGERSearch. Ein Suchwerkzeug für Baumbanken. In: Proceedings of the 6. Konferenz zur Verarbeitung natürlicher Sprache (6th Conference on Natural Language Processing, KONVENS 2002), Saarbrücken, Germany

    Google Scholar 

  • Marcus M, Santorini B, Marcinkiewicz MA (1993) Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics 19(2):313–330

    Google Scholar 

  • Marcus M, Santorini B, Marcinkiewicz M (1994) Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics 19(2):313–330

    Google Scholar 

  • McCrae J, Spohr D, Cimiano P (2011) Linking lexical resources and ontologies on the semantic web with Lemon. The Semantic Web: Research and Applications pp 245–259

    Google Scholar 

  • McCrae J, Montiel-Ponsoda E, Cimiano P (this vol.) Integrating WordNet and Wiktionary with lemon. pp 25–34

    Google Scholar 

  • Müller C, Strube M (2006) Multi-level annotation of linguistic data with MMAX2. In: Corpus Technology and Language Pedagogy, Peter Lang, Frankfurt am Main, pp 197–214

    Google Scholar 

  • Nordhoff S (this vol.) Linked Data for linguistic diversity research: Glottolog/Langdoc and ASJP. pp 191–200

    Google Scholar 

  • Nuzzolese A, Gangemi A, Presutti V (2011) Gathering lexical linked data and knowledge patterns from framenet. In: Proceedings of the sixth international conference on Knowledge capture, ACM, pp 41–48

    Google Scholar 

  • O’Donnell M (2000) Rsttool 2.4 – a markup tool for Rhetorical Structure Theory. In: Proc. International Natural Language Generation Conference (INLG’2000), Mitzpe Ramon, Israel, pp 253–256

    Google Scholar 

  • Pareja-Lora A (this vol.) OntoLingAnnot’s ontologies: Facilitating interoperable linguistic annotations (up to the pragmatic level). pp 117–127

    Google Scholar 

  • Pareja-Lora A, Aguado de Cea G (2010) Ontology-based interoperation of linguistic tools for an improved lemma annotation in Spanish. In: Proceedings of LREC 2010, Valetta, Malta

    Google Scholar 

  • Pustejovsky J, Meyers A, Palmer M, Poesio M (2005) Merging PropBank, NomBank, TimeBank, Penn Discourse Treebank and Coreference. In: Proc. ACL Workshop on Frontiers in Corpus Annotation 2005, Ann Arbor, MI, USA

    Google Scholar 

  • Rehm G, Eckart R, Chiarcos C (2007) An OWL-and XQuery-based mechanism for the retrieval of linguistic patterns from XML-corpora. In: Proc. RANLP 2007, Borovets, Bulgaria

    Google Scholar 

  • Romary L, Zeldes A, Zipser F (2011) <tiger2/> Serialising the ISO SynAF syntactic object model. Arxiv preprint arXiv:1108.0631

  • Saulwick A, Windhouwer M, Dimitriadis A, Goedemans R (2005) Distributed tasking in ontology mediated integration of typological databases for linguistic research. In: Proc. 17th Conf. on Advanced Information Systems Engineering (CAiSE’05), Porto

    Google Scholar 

  • Schiehlen M (2004) Optimizing algorithms for pronoun resolution. In: Proc. 20th International Conference on Computational Linguistics (COLING), Geneva, pp 515–521

    Google Scholar 

  • Schmidt T (2004) EXMARaLDA – Ein System zur computergestützten Diskurstranskription. In: Mehler A, Lobin H (eds) Automatische Textanalyse. Systeme und Methoden zur Annotation und Analyse natürlichsprachlicher Texte, Verlag für Sozialwissenschaften, Wiesbaden, Germany, pp 203–218

    Google Scholar 

  • Schmidt T, Chiarcos C, Lehmberg T, Rehm G, Witt A, Hinrichs E (2006) Avoiding data graveyards. In: Proceedings of the E-MELD workshop on Digital Language Documentation, East Lansing

    Google Scholar 

  • Skut W, Krenn B, Brants T, Uszkoreit H (1997) An annotation scheme for free word order languages. In: Proc. 5th Conference on Applied Natural Language Processing (ANLP), Washington, D.C.

    Google Scholar 

  • Skut W, Brants T, Krenn B, Uszkoreit H (1998) A linguistically interpreted corpus of German newspaper text. In: Proc. ESSLLI Workshop on Recent Advances in Corpus Annotation, Saarbrücken, Germany

    Google Scholar 

  • Stede M, Bieler H, Dipper S, Suriyawongkul A (2006) Summar: Combining linguistics and statistics for text summarization. In: Proc. 17th European Conference on Artificial Intelligence (ECAI-06), Riva del Garda, Italy, pp 827–828

    Google Scholar 

  • Trißl S, Leser U (2007) Fast and practical indexing and querying of very large graphs. In: Proc. 2007 ACM SIGMOD international conference on Management of data, pp 845–856. ACM

    Google Scholar 

  • Windhouwer M, Wright SE (this vol.) Linking to linguistic data categories in ISOcat. pp 99–107

    Google Scholar 

  • Zeldes A, Ritz J, Lüdeling A, Chiarcos C (2009) ANNIS: A search tool for multi-layer annotated corpora. In: Proc. Corpus Linguistics, Liverpool, UK, pp 20–23

    Google Scholar 

  • Zipser F, Romary L (2010) A model oriented approach to the mapping of annotation formats using standards. In: Proc. LREC-2010 Workshop on Language Resource and Language Technology Standards (LR&LTS 2010), Valetta, Malta

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christian Chiarcos .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Chiarcos, C. (2012). Interoperability of Corpora and Annotations. In: Chiarcos, C., Nordhoff, S., Hellmann, S. (eds) Linked Data in Linguistics. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28249-2_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-28249-2_16

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-28248-5

  • Online ISBN: 978-3-642-28249-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics