Interoperability of Corpora and Annotations

Chiarcos, Christian

doi:10.1007/978-3-642-28249-2_16

Interoperability of Corpora and Annotations

Christian Chiarcos⁴

Chapter

1446 Accesses
19 Citations

Abstract

This paper describes the application of OWL and RDF to address the interoperability of linguistic corpora and linguistic annotations within such corpora. Interoperability of linguistic corpora involves two aspects: Structural interoperability (annotations of different origin are represented using the same formalism) and conceptual interoperability (annotations of different origin are linked to a common vocabulary).

Building on an existing infrastructure developed to represent, to store, to query and to visualize multi-layer corpora with any kind of text-oriented annotation, this paper proposes to address both aspects by means of OWL/RDF-based formalisms. Key advantages of this approach include the existence of a rich technological ecosystem developed around RDF and OWL, the conceptual similarity of generic data models for linguistic annotations and RDF (both based on labeled directed graphs), and the application of OWL/DL reasoners that can be applied to validate the consistency of linguistic corpora and their annotations and to infer additional information that is relevant, for example, for their appropriate visualization.

Additionally, representing corpora in OWL and RDF allows to interlink resources freely, e.g., different annotation layers of a multi-layer corpus, translated texts in parallel corpora, or linguistic corpora and lexical-semantic resources. Modeled in this way, corpora can be fully integrated in a Linked Open Data (sub-)cloud of linguistic resources, along with lexical-semantic resources and knowledge bases of information about languages and linguistic terminology.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Hardcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bickel B, Nichols J (2002) Autotypologizing databases and their use in fieldwork. In: Proceedings of the LREC-2002 Workshop on Resources and Tools in Field Linguistics, Las Palmas, Spain
Google Scholar
Bird S, Liberman M (2001) A formal framework for linguistic annotation. Speech Communication 33(1-2):23–60
Article MATH Google Scholar
Boersma P (2002) Praat, a system for doing phonetics by computer. Glot international 5(9/10):341–345
Google Scholar
Bouda P, Cysouw M (this vol.) Treating dictionaries as a Linked-Data corpus. pp 15–23
Google Scholar
Brants S, Dipper S, Eisenberg P, Hansen S, König E, Lezius W, Rohrer C, Smith G, Uszkoreit H (2004) TIGER: Linguistic interpretation of a German corpus. Research on Language and Computation 2(4):597–620
Article Google Scholar
Burchardt A, Padó S, Spohr D, Frank A, Heid U (2008) Formalising Multi-layer Corpora in OWL/DL – Lexicon Modelling, Querying and Consistency Control. In: Proceedings of the 3rd International Joint Conference on NLP (IJCNLP 2008), Hyderabad
Google Scholar
Busemann A, Busemann K (2008) Toolbox self-training. Tech. rep., http://www.sil.org. Version 1.5.4, Oct 2008
Buyko E, Chiarcos C, Pareja-Lora A (2008) Ontology-based interface specifications for a NLP pipeline architecture. In: Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC 2008), Marrakech, Morocco
Google Scholar
Carletta J, Evert S, Heid U, Kilgour J (2005) The NITE XML Toolkit: data model and query. Language Resources and Evaluation Journal (LREJ) 39(4):313–334
Article Google Scholar
Aguado de Cea G, Gomez-Perez A, Alvarez de Mon I, Pareja-Lora A (2004) OntoTag’s linguistic ontologies. In: Proc. Information Technology: Coding and Computing (ITCC’04), Washington, DC, USA
Google Scholar
Chiarcos C (2008) An ontology of linguistic annotations. LDV Forum 23(1):1–16
Google Scholar
Chiarcos C (2010a) Grounding an ontology of linguistic annotations in the Data Category Registry. In: LREC 2010 Workshop on Language Resource and Language Technology Standards (LT&LTS), Valetta, Malta, pp 37–40
Google Scholar
Chiarcos C (2010b) Towards robust multi-tool tagging. An OWL/DL-based approach. In: ACL 2010, Uppsala, Sweden, pp 659–670
Google Scholar
Chiarcos C, Erjavec T (2011) OWL/DL formalization of the MULTEXT-East morphosyntactic specifications. In: 5th Linguistic Annotation Workshop, Portland, pp 11–20
Google Scholar
Chiarcos C, Dipper S, Götze M, Leser U, Lüdeling A, Ritz J, Stede M (2008) A Flexible Framework for Integrating Annotations from Different Tools and Tagsets. TAL (Traitement automatique des langues) 49(2)
Google Scholar
Chiarcos C, Ritz J, Stede M (2011) By all these lovely tokens ... Merging conflicting tokenizations. Journal of Language Resources and Evaluation (LREJ) 4(45). to appear
Google Scholar
Dipper S (2005) XML-based stand-off representation and exploitation of multi-level linguistic annotation. In: Proc. Berliner XML Tage 2005 (BXML 2005), Berlin, Germany, pp 39–50
Google Scholar
Eckart K, Riester A, Schweitzer K (this vol.) A discourse information radio news database for linguistic analysis. pp 65–75
Google Scholar
Erjavec T (2004) MULTEXT-East version 3: Multilingual morphosyntactic specifications, lexicons and corpora. In: Fourth International Conference on Language Resources and Evaluation, (LREC 2004), Lisboa, Portugal, pp 1535–1538
Google Scholar
Farrar S, Langendoen D (2003) A linguistic ontology for the semantic web. Glot International 7(3):97–100
Google Scholar
Hellmann S, Unbehauen J, Chiarcos C, Ngonga Ngomo A (2010) The TIGER Corpus Navigator. In: 9th International Workshop on Treebanks and Linguistic Theories (TLT-9), Tartu, Estonia, pp 91–102
Google Scholar
Hellmann S, Stadler C, Lehmann J (this vol.) The German DBpedia: A sense repository for linking entities. pp 181–189
Google Scholar
Hellwig B, Uytvanck DV, Hulsbosch M (2008) ELAN - Linguistic Annotator. Tech. rep., http://www.lat-mpi.eu/tools/elan. version of 2008-07-31
Ide N, Pustejovsky J (2010) What does interoperability mean, anyway? Toward an operational definition of interoperability. In: Proc. Second International Conference on Global Interoperability for Language Resources (ICGL 2010), Hong Kong, China
Google Scholar
Ide N, Romary L (2004) International standard for a linguistic annotation framework. Natural language engineering 10(3-4):211–225
Article Google Scholar
Ide N, Suderman K (2007) GrAF: A graph-based format for linguistic annotations. In: Proc. Linguistic Annotation Workshop (LAW 2007), Prague, Czech Republic, pp 1–8
Google Scholar
Kemps-Snijders M (2010) Relish: Rendering endangered languages lexicons interoperable through standards harmonisation. In: 7th SaLTMiL Workshop on Creation and use of basic lexical resources for less-resourced languages, held in conjunction with LREC 2010, Valetta, Malta
Google Scholar
Kemps-Snijders M, Windhouwer M, Wittenburg P, Wright S (2009) ISOcat: Remodelling metadata for language resources. International Journal of Metadata, Semantics and Ontologies 4(4):261–276
Article Google Scholar
Kingsbury P, Palmer M (2002) From treebank to propbank. In: Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC-2002), Citeseer, pp 1989–1993
Google Scholar
König E, Lezius W (2000) A description language for syntactically annotated corpora. In: Proc. 18th International Conference on Computational Linguistics (COLING 2000), Saarbrücken, Germany, pp 1056–1060
Google Scholar
Leech G, Wilson A (1996) EAGLES recommendations for the morphosyntactic annotation of corpora. http://www.ilc.cnr.it/EAGLES/annotate/annotate.html, version of March 1996
Lezius W (2002) TIGERSearch. Ein Suchwerkzeug für Baumbanken. In: Proceedings of the 6. Konferenz zur Verarbeitung natürlicher Sprache (6th Conference on Natural Language Processing, KONVENS 2002), Saarbrücken, Germany
Google Scholar
Marcus M, Santorini B, Marcinkiewicz MA (1993) Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics 19(2):313–330
Google Scholar
Marcus M, Santorini B, Marcinkiewicz M (1994) Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics 19(2):313–330
Google Scholar
McCrae J, Spohr D, Cimiano P (2011) Linking lexical resources and ontologies on the semantic web with Lemon. The Semantic Web: Research and Applications pp 245–259
Google Scholar
McCrae J, Montiel-Ponsoda E, Cimiano P (this vol.) Integrating WordNet and Wiktionary with lemon. pp 25–34
Google Scholar
Müller C, Strube M (2006) Multi-level annotation of linguistic data with MMAX2. In: Corpus Technology and Language Pedagogy, Peter Lang, Frankfurt am Main, pp 197–214
Google Scholar
Nordhoff S (this vol.) Linked Data for linguistic diversity research: Glottolog/Langdoc and ASJP. pp 191–200
Google Scholar
Nuzzolese A, Gangemi A, Presutti V (2011) Gathering lexical linked data and knowledge patterns from framenet. In: Proceedings of the sixth international conference on Knowledge capture, ACM, pp 41–48
Google Scholar
O’Donnell M (2000) Rsttool 2.4 – a markup tool for Rhetorical Structure Theory. In: Proc. International Natural Language Generation Conference (INLG’2000), Mitzpe Ramon, Israel, pp 253–256
Google Scholar
Pareja-Lora A (this vol.) OntoLingAnnot’s ontologies: Facilitating interoperable linguistic annotations (up to the pragmatic level). pp 117–127
Google Scholar
Pareja-Lora A, Aguado de Cea G (2010) Ontology-based interoperation of linguistic tools for an improved lemma annotation in Spanish. In: Proceedings of LREC 2010, Valetta, Malta
Google Scholar
Pustejovsky J, Meyers A, Palmer M, Poesio M (2005) Merging PropBank, NomBank, TimeBank, Penn Discourse Treebank and Coreference. In: Proc. ACL Workshop on Frontiers in Corpus Annotation 2005, Ann Arbor, MI, USA
Google Scholar
Rehm G, Eckart R, Chiarcos C (2007) An OWL-and XQuery-based mechanism for the retrieval of linguistic patterns from XML-corpora. In: Proc. RANLP 2007, Borovets, Bulgaria
Google Scholar
Romary L, Zeldes A, Zipser F (2011) <tiger2/> Serialising the ISO SynAF syntactic object model. Arxiv preprint arXiv:1108.0631
Saulwick A, Windhouwer M, Dimitriadis A, Goedemans R (2005) Distributed tasking in ontology mediated integration of typological databases for linguistic research. In: Proc. 17th Conf. on Advanced Information Systems Engineering (CAiSE’05), Porto
Google Scholar
Schiehlen M (2004) Optimizing algorithms for pronoun resolution. In: Proc. 20th International Conference on Computational Linguistics (COLING), Geneva, pp 515–521
Google Scholar
Schmidt T (2004) EXMARaLDA – Ein System zur computergestützten Diskurstranskription. In: Mehler A, Lobin H (eds) Automatische Textanalyse. Systeme und Methoden zur Annotation und Analyse natürlichsprachlicher Texte, Verlag für Sozialwissenschaften, Wiesbaden, Germany, pp 203–218
Google Scholar
Schmidt T, Chiarcos C, Lehmberg T, Rehm G, Witt A, Hinrichs E (2006) Avoiding data graveyards. In: Proceedings of the E-MELD workshop on Digital Language Documentation, East Lansing
Google Scholar
Skut W, Krenn B, Brants T, Uszkoreit H (1997) An annotation scheme for free word order languages. In: Proc. 5th Conference on Applied Natural Language Processing (ANLP), Washington, D.C.
Google Scholar
Skut W, Brants T, Krenn B, Uszkoreit H (1998) A linguistically interpreted corpus of German newspaper text. In: Proc. ESSLLI Workshop on Recent Advances in Corpus Annotation, Saarbrücken, Germany
Google Scholar
Stede M, Bieler H, Dipper S, Suriyawongkul A (2006) Summar: Combining linguistics and statistics for text summarization. In: Proc. 17th European Conference on Artificial Intelligence (ECAI-06), Riva del Garda, Italy, pp 827–828
Google Scholar
Trißl S, Leser U (2007) Fast and practical indexing and querying of very large graphs. In: Proc. 2007 ACM SIGMOD international conference on Management of data, pp 845–856. ACM
Google Scholar
Windhouwer M, Wright SE (this vol.) Linking to linguistic data categories in ISOcat. pp 99–107
Google Scholar
Zeldes A, Ritz J, Lüdeling A, Chiarcos C (2009) ANNIS: A search tool for multi-layer annotated corpora. In: Proc. Corpus Linguistics, Liverpool, UK, pp 20–23
Google Scholar
Zipser F, Romary L (2010) A model oriented approach to the mapping of annotation formats using standards. In: Proc. LREC-2010 Workshop on Language Resource and Language Technology Standards (LR&LTS 2010), Valetta, Malta
Google Scholar

Download references

Author information

Authors and Affiliations

Information Sciences Institute, University of Southern California, 4676 Admiralty Way # 1001, Marina del Rey, CA, 90292, USA
Christian Chiarcos

Authors

Christian Chiarcos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christian Chiarcos .

Editor information

Editors and Affiliations

, Information Science Institute, University of Southern California, Admiralty Way 4676, Marina del Rey, 90292, California, USA
Christian Chiarcos
Department of Linguistics, Evolutionary Anthropology Leipzig, Max-Planck Instutite for, Deutscher Platz 6, Leipzig, 04103, Germany
Sebastian Nordhoff
, Business Information Systems, University of Leipzig, Johannisgasse 26, Leipzig, 04103, Germany
Sebastian Hellmann

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Chiarcos, C. (2012). Interoperability of Corpora and Annotations. In: Chiarcos, C., Nordhoff, S., Hellmann, S. (eds) Linked Data in Linguistics. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28249-2_16

Download citation

DOI: https://doi.org/10.1007/978-3-642-28249-2_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28248-5
Online ISBN: 978-3-642-28249-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics