Abstract
This paper describes the application of OWL and RDF to address the interoperability of linguistic corpora and linguistic annotations within such corpora. Interoperability of linguistic corpora involves two aspects: Structural interoperability (annotations of different origin are represented using the same formalism) and conceptual interoperability (annotations of different origin are linked to a common vocabulary).
Building on an existing infrastructure developed to represent, to store, to query and to visualize multi-layer corpora with any kind of text-oriented annotation, this paper proposes to address both aspects by means of OWL/RDF-based formalisms. Key advantages of this approach include the existence of a rich technological ecosystem developed around RDF and OWL, the conceptual similarity of generic data models for linguistic annotations and RDF (both based on labeled directed graphs), and the application of OWL/DL reasoners that can be applied to validate the consistency of linguistic corpora and their annotations and to infer additional information that is relevant, for example, for their appropriate visualization.
Additionally, representing corpora in OWL and RDF allows to interlink resources freely, e.g., different annotation layers of a multi-layer corpus, translated texts in parallel corpora, or linguistic corpora and lexical-semantic resources. Modeled in this way, corpora can be fully integrated in a Linked Open Data (sub-)cloud of linguistic resources, along with lexical-semantic resources and knowledge bases of information about languages and linguistic terminology.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Bickel B, Nichols J (2002) Autotypologizing databases and their use in fieldwork. In: Proceedings of the LREC-2002 Workshop on Resources and Tools in Field Linguistics, Las Palmas, Spain
Bird S, Liberman M (2001) A formal framework for linguistic annotation. Speech Communication 33(1-2):23–60
Boersma P (2002) Praat, a system for doing phonetics by computer. Glot international 5(9/10):341–345
Bouda P, Cysouw M (this vol.) Treating dictionaries as a Linked-Data corpus. pp 15–23
Brants S, Dipper S, Eisenberg P, Hansen S, König E, Lezius W, Rohrer C, Smith G, Uszkoreit H (2004) TIGER: Linguistic interpretation of a German corpus. Research on Language and Computation 2(4):597–620
Burchardt A, Padó S, Spohr D, Frank A, Heid U (2008) Formalising Multi-layer Corpora in OWL/DL – Lexicon Modelling, Querying and Consistency Control. In: Proceedings of the 3rd International Joint Conference on NLP (IJCNLP 2008), Hyderabad
Busemann A, Busemann K (2008) Toolbox self-training. Tech. rep., http://www.sil.org. Version 1.5.4, Oct 2008
Buyko E, Chiarcos C, Pareja-Lora A (2008) Ontology-based interface specifications for a NLP pipeline architecture. In: Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC 2008), Marrakech, Morocco
Carletta J, Evert S, Heid U, Kilgour J (2005) The NITE XML Toolkit: data model and query. Language Resources and Evaluation Journal (LREJ) 39(4):313–334
Aguado de Cea G, Gomez-Perez A, Alvarez de Mon I, Pareja-Lora A (2004) OntoTag’s linguistic ontologies. In: Proc. Information Technology: Coding and Computing (ITCC’04), Washington, DC, USA
Chiarcos C (2008) An ontology of linguistic annotations. LDV Forum 23(1):1–16
Chiarcos C (2010a) Grounding an ontology of linguistic annotations in the Data Category Registry. In: LREC 2010 Workshop on Language Resource and Language Technology Standards (LT<S), Valetta, Malta, pp 37–40
Chiarcos C (2010b) Towards robust multi-tool tagging. An OWL/DL-based approach. In: ACL 2010, Uppsala, Sweden, pp 659–670
Chiarcos C, Erjavec T (2011) OWL/DL formalization of the MULTEXT-East morphosyntactic specifications. In: 5th Linguistic Annotation Workshop, Portland, pp 11–20
Chiarcos C, Dipper S, Götze M, Leser U, Lüdeling A, Ritz J, Stede M (2008) A Flexible Framework for Integrating Annotations from Different Tools and Tagsets. TAL (Traitement automatique des langues) 49(2)
Chiarcos C, Ritz J, Stede M (2011) By all these lovely tokens ... Merging conflicting tokenizations. Journal of Language Resources and Evaluation (LREJ) 4(45). to appear
Dipper S (2005) XML-based stand-off representation and exploitation of multi-level linguistic annotation. In: Proc. Berliner XML Tage 2005 (BXML 2005), Berlin, Germany, pp 39–50
Eckart K, Riester A, Schweitzer K (this vol.) A discourse information radio news database for linguistic analysis. pp 65–75
Erjavec T (2004) MULTEXT-East version 3: Multilingual morphosyntactic specifications, lexicons and corpora. In: Fourth International Conference on Language Resources and Evaluation, (LREC 2004), Lisboa, Portugal, pp 1535–1538
Farrar S, Langendoen D (2003) A linguistic ontology for the semantic web. Glot International 7(3):97–100
Hellmann S, Unbehauen J, Chiarcos C, Ngonga Ngomo A (2010) The TIGER Corpus Navigator. In: 9th International Workshop on Treebanks and Linguistic Theories (TLT-9), Tartu, Estonia, pp 91–102
Hellmann S, Stadler C, Lehmann J (this vol.) The German DBpedia: A sense repository for linking entities. pp 181–189
Hellwig B, Uytvanck DV, Hulsbosch M (2008) ELAN - Linguistic Annotator. Tech. rep., http://www.lat-mpi.eu/tools/elan. version of 2008-07-31
Ide N, Pustejovsky J (2010) What does interoperability mean, anyway? Toward an operational definition of interoperability. In: Proc. Second International Conference on Global Interoperability for Language Resources (ICGL 2010), Hong Kong, China
Ide N, Romary L (2004) International standard for a linguistic annotation framework. Natural language engineering 10(3-4):211–225
Ide N, Suderman K (2007) GrAF: A graph-based format for linguistic annotations. In: Proc. Linguistic Annotation Workshop (LAW 2007), Prague, Czech Republic, pp 1–8
Kemps-Snijders M (2010) Relish: Rendering endangered languages lexicons interoperable through standards harmonisation. In: 7th SaLTMiL Workshop on Creation and use of basic lexical resources for less-resourced languages, held in conjunction with LREC 2010, Valetta, Malta
Kemps-Snijders M, Windhouwer M, Wittenburg P, Wright S (2009) ISOcat: Remodelling metadata for language resources. International Journal of Metadata, Semantics and Ontologies 4(4):261–276
Kingsbury P, Palmer M (2002) From treebank to propbank. In: Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC-2002), Citeseer, pp 1989–1993
König E, Lezius W (2000) A description language for syntactically annotated corpora. In: Proc. 18th International Conference on Computational Linguistics (COLING 2000), Saarbrücken, Germany, pp 1056–1060
Leech G, Wilson A (1996) EAGLES recommendations for the morphosyntactic annotation of corpora. http://www.ilc.cnr.it/EAGLES/annotate/annotate.html, version of March 1996
Lezius W (2002) TIGERSearch. Ein Suchwerkzeug für Baumbanken. In: Proceedings of the 6. Konferenz zur Verarbeitung natürlicher Sprache (6th Conference on Natural Language Processing, KONVENS 2002), Saarbrücken, Germany
Marcus M, Santorini B, Marcinkiewicz MA (1993) Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics 19(2):313–330
Marcus M, Santorini B, Marcinkiewicz M (1994) Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics 19(2):313–330
McCrae J, Spohr D, Cimiano P (2011) Linking lexical resources and ontologies on the semantic web with Lemon. The Semantic Web: Research and Applications pp 245–259
McCrae J, Montiel-Ponsoda E, Cimiano P (this vol.) Integrating WordNet and Wiktionary with lemon. pp 25–34
Müller C, Strube M (2006) Multi-level annotation of linguistic data with MMAX2. In: Corpus Technology and Language Pedagogy, Peter Lang, Frankfurt am Main, pp 197–214
Nordhoff S (this vol.) Linked Data for linguistic diversity research: Glottolog/Langdoc and ASJP. pp 191–200
Nuzzolese A, Gangemi A, Presutti V (2011) Gathering lexical linked data and knowledge patterns from framenet. In: Proceedings of the sixth international conference on Knowledge capture, ACM, pp 41–48
O’Donnell M (2000) Rsttool 2.4 – a markup tool for Rhetorical Structure Theory. In: Proc. International Natural Language Generation Conference (INLG’2000), Mitzpe Ramon, Israel, pp 253–256
Pareja-Lora A (this vol.) OntoLingAnnot’s ontologies: Facilitating interoperable linguistic annotations (up to the pragmatic level). pp 117–127
Pareja-Lora A, Aguado de Cea G (2010) Ontology-based interoperation of linguistic tools for an improved lemma annotation in Spanish. In: Proceedings of LREC 2010, Valetta, Malta
Pustejovsky J, Meyers A, Palmer M, Poesio M (2005) Merging PropBank, NomBank, TimeBank, Penn Discourse Treebank and Coreference. In: Proc. ACL Workshop on Frontiers in Corpus Annotation 2005, Ann Arbor, MI, USA
Rehm G, Eckart R, Chiarcos C (2007) An OWL-and XQuery-based mechanism for the retrieval of linguistic patterns from XML-corpora. In: Proc. RANLP 2007, Borovets, Bulgaria
Romary L, Zeldes A, Zipser F (2011) <tiger2/> Serialising the ISO SynAF syntactic object model. Arxiv preprint arXiv:1108.0631
Saulwick A, Windhouwer M, Dimitriadis A, Goedemans R (2005) Distributed tasking in ontology mediated integration of typological databases for linguistic research. In: Proc. 17th Conf. on Advanced Information Systems Engineering (CAiSE’05), Porto
Schiehlen M (2004) Optimizing algorithms for pronoun resolution. In: Proc. 20th International Conference on Computational Linguistics (COLING), Geneva, pp 515–521
Schmidt T (2004) EXMARaLDA – Ein System zur computergestützten Diskurstranskription. In: Mehler A, Lobin H (eds) Automatische Textanalyse. Systeme und Methoden zur Annotation und Analyse natürlichsprachlicher Texte, Verlag für Sozialwissenschaften, Wiesbaden, Germany, pp 203–218
Schmidt T, Chiarcos C, Lehmberg T, Rehm G, Witt A, Hinrichs E (2006) Avoiding data graveyards. In: Proceedings of the E-MELD workshop on Digital Language Documentation, East Lansing
Skut W, Krenn B, Brants T, Uszkoreit H (1997) An annotation scheme for free word order languages. In: Proc. 5th Conference on Applied Natural Language Processing (ANLP), Washington, D.C.
Skut W, Brants T, Krenn B, Uszkoreit H (1998) A linguistically interpreted corpus of German newspaper text. In: Proc. ESSLLI Workshop on Recent Advances in Corpus Annotation, Saarbrücken, Germany
Stede M, Bieler H, Dipper S, Suriyawongkul A (2006) Summar: Combining linguistics and statistics for text summarization. In: Proc. 17th European Conference on Artificial Intelligence (ECAI-06), Riva del Garda, Italy, pp 827–828
Trißl S, Leser U (2007) Fast and practical indexing and querying of very large graphs. In: Proc. 2007 ACM SIGMOD international conference on Management of data, pp 845–856. ACM
Windhouwer M, Wright SE (this vol.) Linking to linguistic data categories in ISOcat. pp 99–107
Zeldes A, Ritz J, Lüdeling A, Chiarcos C (2009) ANNIS: A search tool for multi-layer annotated corpora. In: Proc. Corpus Linguistics, Liverpool, UK, pp 20–23
Zipser F, Romary L (2010) A model oriented approach to the mapping of annotation formats using standards. In: Proc. LREC-2010 Workshop on Language Resource and Language Technology Standards (LR<S 2010), Valetta, Malta
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Chiarcos, C. (2012). Interoperability of Corpora and Annotations. In: Chiarcos, C., Nordhoff, S., Hellmann, S. (eds) Linked Data in Linguistics. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28249-2_16
Download citation
DOI: https://doi.org/10.1007/978-3-642-28249-2_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28248-5
Online ISBN: 978-3-642-28249-2
eBook Packages: Computer ScienceComputer Science (R0)