Abstract
We describe on going community-efforts to create a Linked Open Data (sub-)cloud of linguistic resources, with an emphasis on resources that are specific to linguistic research, namely annotated corpora and linguistic databases. We argue that for both types of resources, the application of the Linked Open Data paradigm and the representation in RDF represents a promising approach to address interoperability problems, and to integrate information from different repositories. This is illustrated with example studies for different kinds of linguistic resources.The efforts described in this chapter are conducted in the context of the Open Linguistics Working Group (OWLG) of the Open Knowledge Foundation. The OWLG is a network of researchers interested in linguistic resources and/or their publication under open licenses, and a number of its members are engaged in the application of the Linked Open Data paradigm to their resources. Under the umbrella of the OWLG, these efforts will eventually emerge in the creation of a Linguistic Linked Open Data cloud (LLOD).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The term ‘resource’ is ambiguous here. As understood in this chapter, resources are structured collections of data which can be represented, for example, using RDF. In RDF, however, ‘resource’ is the conventional name of a node in the graph, because, historically, these nodes were meant to represent objects that are described by metadata. Hence, we use the terms ‘node’ or ‘concept’ whenever RDFresources are meant.
- 2.
Federation is possible with SPARQL, although not necessarily very performant with state-of-the-art implementations. A more efficient way than federation is thus to retrieve the content necessary for a particular application from another end point and to query it locally. SPARQL end points provide this functionality, and publishing data under open licenses (see below) warantees that the necessary legal preconditions for this practice are met.
- 3.
http://www.w3.org/DesignIssues/LinkedData.html, paragraph ‘Is your Linked Open Data 5 Star?’
- 4.
Although the application of RDF to linguistic resources as described here has been occasionally suggested, see [11, 13] for linguistic corpora, but these approaches focused on the RDF representation of individual resources rather than linking them with other types of linguistic resources. As opposed to this, the focus of this chapter is not on modeling linguistic resources, but rather, on the potential to linking these with each other.
- 5.
- 6.
- 7.
- 8.
Note that [68] have also defined a TF-ICF measure (where “C” stands for “corpus”) with the objective to generate vectors of streaming documents in linear time. In DBpedia Spotlight’s TF*ICF (where “C” stands for “candidate”) the objective is to give more weight to words that are rare among confusable entities.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
- 19.
- 20.
- 21.
Compiled by Eugene Chang, http://lingweb.eva.mpg.de/numeral/
- 22.
- 23.
- 24.
- 25.
- 26.
- 27.
- 28.
- 29.
- 30.
- 31.
- 32.
- 33.
For example, complex copyright situations may arise if one resource (say, a lexicon) was developed on the basis of another resource (say, a newspaper archive), and researchers are uncertain whether the examples from the original newspaper contained in the lexicon violate the original copyright. Ethical problems may arise if a data base of quotations from a newspaper is linked to a data base of speakers, and this data base is further connected with, say, obituaries from the same newspaper. Even if this was done only in order to study generation-specific language variation, one may wonder whether such an accumulation of information violates the privacy of the people involved.
- 34.
- 35.
References
Abney S, Bird S (2010) The Human Language Project: Building a universal corpus of the world’s languages. In: 48th Annual Meeting of the Association for Computational Linguistics (ACL-2010), Uppsala, Sweden, pp 88–97
Baker C, Fellbaum C (2009) WordNet and FrameNet as complementary resources for annotation. In: 3rd Linguistic Annotation Workshop (LAW-2009), Suntec, Singapore, pp 125–129
Bakker D, Dahl O, Haspelmath M, Koptjevskaja-Tamm M, Lehmann C, Siewierska A (1993) EUROTYP guidelines. Technical report, European Science Foundation Programme in Language Typology
Baker C, Fillmore C, Lowe J (1998) The Berkeley FrameNet project. In: 36th Annual Meeting of the Association for Computational Linguistics (ACL-1998), Montréal, Canada, pp 86–90
Bender E (2008) Evaluating a crosslinguistic grammar resource: A case study of Wambaya. In: 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-2008: HLT), Columbus, Ohio, pp 977–985
Berners-Lee T (2006) Design issues: Linked data. http://www.w3.org/DesignIssues/LinkedData.html. Accessed 31 July 2012
Bies A, Ferguson M, Katz K, MacIntyre R (1995) Bracketing guidelines for Treebank II style Penn Treebank project. ftp://ftp.cis.upenn.edu/pub/treebank/doc/manual/root.ps.gz. Accessed 31 July 2012, version of January 1995
Bird S, Liberman M (2001) A formal framework for linguistic annotation. Speech Commun 33(1-2):23–60
Brandes U, Eiglsperger M, Herman I, Himsolt M, Marshall M (2002) GraphML progress report: Structural layer proposal. In: 9th International Symposium on Graph Drawing (GD-2001), Vienna, Austria, pp 501–512
Brown C, Holman E, Wichmann S, Velupillai V (2008) Automated classification of the world’s languages. STUF Lang Typol Univers 61(4):286–308
Burchardt A, Padó S, Spohr D, Frank A, Heid U (2008) Formalising multi-layer corpora in OWL/DL – Lexicon modelling, querying and consistency control. In: 3rd International Joint Conference on NLP (IJCNLP-2008), Hyderabad, India
Carletta J, Evert S, Heid U et al (2003) The NITE XML toolkit: Flexible annotation for multi-modal language data. Behav Res Methods Instrum Comput 35(3):353–363
Cassidy S (2010) An RDF realisation of LAF in the DADA annotation server. In: 5th Joint ISO-ACL/SIGSEM Workshop on Interoperable Semantic Annotation (ISA-5), Hong Kong, China
Chiarcos C (2012) A generic formalism to represent linguistic corpora in RDF and OWL/DL. In: 8th International Conference on Language Resources and Evaluation (LREC-2012), Istanbul, Turkey, pp 3205–3212
Chiarcos C (2012) Ontologies of linguistic annotation: Survey and perspectives. In: 8th International Conference on Language Resources and Evaluation (LREC-2012), Istanbul, Turkey, pp 303–310
Chiarcos C (2012) POWLA: Modeling linguistic corpora in OWL/DL. In: 9th Extended Semantic Web Conference (ESWC-2012), Heraklion, Crete, pp 225–239
Chiarcos C, Dipper S, Götze M, Leser U, Lüdeling A, Ritz J, Stede M (2008) A flexible framework for integrating annotations from different tools and tagsets. TAL (Traitement automatique des langues) 49(2):217–246
Chiarcos C, Hellmann S, Nordhoff S, Moran S, Littauer R, Eckle-Kohler J, Gurevych I, Hartmann S, Matuschek M, Meyer C (2012) The Open Linguistics Working Group. In: 8th International Conference on Language Resources and Evaluation (LREC-2012), Istanbul, Turkey, pp 3603–3610
Chiarcos C, Nordhoff S, Hellmann S (eds) (2012) Linked data in linguistics. Representing and connecting language data and language metadata. Springer, Heidelberg
Chiarcos C, McCrae J, Cimiano P, Fellbaum C (2012) Towards open data for linguistics: Linguistic linked data. In: Oltramari A, Lu-Qin, Vossen P, Hovy E (eds) New Trends of Research in Ontologies and Lexical Resources. Springer, Heidelberg
Corbett G (2005) Number of genders. In: Haspelmath M, Dryer M, Gil D, Comrie B (eds) The World Atlas of Language Structures. Oxford University Press, Oxford
Declerck T (2006) SynAF: Towards a standard for syntactic annotation. In: 5th International Conference on Language Resources and Evaluation (LREC-2006), Genoa, Italy, pp 229–232
Dimitriadis A, Everaert M, Reinhart T, Reuland E (2005) Anaphora typology database. http://languagelink.let.uu.nl/anatyp. Accessed 31 July 2012
Dostert L (1955) The Georgetown-IBM experiment. In: Locke W, Booth A (eds) Machine translation of languages. Wiley, New York, pp 124–135
Dryer M (1997) On the six-way word order typology. Stud Lang 21(1):69–103
Eckart R (2008) Choosing an XML database for linguistically annotated corpora. Sprache und Datenverarbeitung 32(1):7–22
Farrar S, Langendoen DT (2003) A linguistic ontology for the Semantic Web. GLOT Int 7:97–100
Farrar S, Langendoen DT (2010) An OWL-DL implementation of GOLD: An ontology for the Semantic Web. In: Witt A, Metzing D (eds) Linguistic modeling of information and markup languages, Springer, Dordrecht
Francis WN, Kucera H (1964) Brown Corpus manual, revised edition. Technival report, Brown University, Providence, Rhode Island, 1979
Francopoulo G, George M, Calzolari N, Monachini M, Bel N, Pet M, Soria C (2006) Lexical Markup Framework (LMF). In: 5th International Conference on Language Resources and Evaluation (LREC-2006), Genoa, Italy, pp 233–236
Gangemi A, Navigli R, Velardi P (2003) The OntoWordNet project: Extension and axiomatization of conceptual relations in WordNet. In: Meersman R, Tari Z (eds) Proceedings of On the Move to Meaningful Internet Systems (OTM-2003), Catania, Italy, pp 820–838
Good J, Hendryx-Parker C (2006) Modeling contested categorization in linguistic databases. In: EMELD Workshop on Digital Language Documentation, East Lansing, MI
Greenberg J (1960) A quantitative approach to the morphological typology of languages. Int J Am Linguist 26:178–194
Haspelmath M, Tadmor U (eds) (2009) World Loanword Database. Max Planck Digital Library, Munich
Haspelmath M, Dryer M, Gil D, Comrie B (eds) (2008) The World Atlas of Language Structures online. Max Planck Digital Library, Munich
Hellmann S, Unbehauen J, Chiarcos C, Ngonga Ngomo AC (2010) The TIGER Corpus Navigator. In: 9th International Workshop on Treebanks and Linguistic Theories (TLT-9), Tartu, Estonia, pp 91–102
Hellmann S, Lehmann J, Auer S (2012) Linked-data aware URI schemes for referencing text fragments. In: 18th International Conference on Knowledge Engineering and Knowledge Management (EKAW-2012), Galway, Ireland
Hellmann S, Stadler C, Lehmann J (2012) The German DBpedia: A sense repository for linking entities. In: Chiarcos C, Nordhoff S, Hellmann S (eds) Linked data in linguistics, Springer, Heidelberg, pp 181–190
Hwa R, Resnik P, Weinberg A, Cabezas C, Kolak O (2005) Bootstrapping parsers via syntactic projection across parallel texts. Nat Lang Eng 11(3):311–325
Ide N, Pustejovsky J (2010) What does interoperability mean, anyway? Toward an operational definition of interoperability. In: 2nd International Conference on Global Interoperability for Language Resources (ICGL-2010), Hong Kong, China
Ide N, Romary L (2004) International standard for a linguistic annotation framework. Nat Lang Eng 10(3-4):211–225
Ide N, Romary L (2004) A registry of standard data categories for linguistic annotation. In: 4th International Conference on Language Resources and Evaluation (LREC-2004), Lisbon, Portugal, pp 135–139
Ide N, Romary L (2006) Representing linguistic corpora and their annotations. In: 5th International Conference on Language Resources and Evaluation (LREC-2006), Genoa, Italy, pp 225–228
Ide N, Suderman K (2007) GrAF: A graph-based format for linguistic annotations. In: 1st Linguistic Annotation Workshop (LAW-2007), Prague, Czech Republic, pp 1–8
Ide N, Baker CF, Fellbaum C, Fillmore CJ, Passonneau R (2008) MASC: The Manually Annotated Sub-Corpus of American English. In: 6th International Conference on Language Resources and Evaluation (LREC-2008), Marrakech, Morocco, pp 2455–2461
Kemps-Snijders M, Windhouwer M, Wittenburg P, Wright S (2008) ISOcat: Corralling data categories in the wild. In: 6th International Conference on Language Resources and Evaluation (LREC-2008), Marrakech, Morocco, pp 887–891
Kilgarriff A, Grefenstette G (2003) Introduction to the special issue on the Web as Corpus. Comput Linguist 29(3):333–347
Leech G, Wilson A (1996) EAGLES recommendations for the morphosyntactic annotation of corpora. http://www.ilc.cnr.it/EAGLES/annotate/annotate.html. Accessed 31 July 2012
Lehmann J, Bizer C, Kobilarov G et al (2009) DBpedia – A crystallization point for the Web of Data. J Web Semant 7(3):154–165
Lewis W (2010) Haitian Creole: How to build and ship an MT engine from scratch in 4 days, 17 hours, & 30 minutes. In: 14th Annual Conference of the European Association for Machine Translation (EAMT-2010), Saint-Raphaël, France
Lux M, Laußmann J, Mehler A, Menßen C (2011) An online platform for visualizing lexical networks. In: 2011 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT 2011), Lyons, France, pp 495–496
Maddieson I (1984) Patterns of Sounds. Cambridge University Press, Cambridge/New York
McClanahan P, Busby G, Haertel R, Heal K, Lonsdale D, Seppi K, Ringger E (2010) A probabilistic morphological analyzer for Syriac. In: 14th Conference on Empirical Methods on Natural Language Processing (EMNLP-2010), Cambridge, MA, pp 810–820
McCrae J, Spohr D, Cimiano P (2011) Linking lexical resources and ontologies on the semantic web with Lemon. In: 8th Extended Semantic Web Conference (ESWC-2011), Heraklion, Crete. Springer, pp 245–259
Mendes P, Jakob M, García-Silva A, Bizer C (2011) DBpedia Spotlight: Shedding light on the Web of Documents. In: 7th International Conference on Semantic Systems (I-Semantics 2011), Graz, Austria
Mendes P, Daiber J, Rajapakse R et al (2012) Evaluating the impact of phrase recognition on concept tagging. In: 8th International Conference on Language Resources and Evaluation (LREC-2012), Istanbul, Turkey
Mendes P, Jakob M, Bizer C (2012) DBpedia for NLP: A multilingual cross-domain knowledge base. In: 8th International Conference on Language Resources and Evaluation (LREC-2012), Istanbul, Turkey
Meyers A, Ide N, Denoyer L, Shinyama Y (2007) The shared corpora working group report. In: 1st Linguistic Annotation Workshop (LAW-2007), Prague, Czech Republic, pp 184–190
Michaelis S, Maurer P, Haspelmath M, Huber M (eds) (to appear 2013) Atlas of Pidgin and Creole Language Structures. Oxford University Press, Oxford
Moran S (2012) Phonetics information base and lexicon. PhD thesis, University of Washington
Moran S (2012) Using linked data to create a typological knowledge base. In: Chiarcos C, Nordhoff S, Hellmann S (eds) Linked data in linguistics. Springer, Heidelberg, pp 129–138
Moran S, McCloy D, Wright R (2012) Revisiting the population vs phoneme inventory correlation. In: 86th Annual Meeting of the Linguistic Society of America (LSA-2012), Portland, OR
Morris W (ed) (1969) The American Heritage dictionary of the English language. Houghton Mifflin, New York
Nordhoff S, Hammarström H (2011) Glottolog/Langdoc: Defining dialects, languages, and language families as collections of resources. In: 1st International Workshop on Linked Science (LISC-2011), Bonn, Germany
Pederson T (2008) Empiricism is not a matter of faith. Comput Linguist 34(3):465–470
Ponzetto S, Navigli R (2009) Large-scale taxonomy mapping for restructuring and integrating Wikipedia. In: 21st International Joint Conference on Artificial Intelligence (IJCAI-2009), Pasadena, CA, pp 2083–2088
Prud’Hommeaux E, Seaborne A (2008) SPARQL query language for RDF. W3C Recommendation. http://www.w3.org/TR/rdf-sparql-query, Accessed Dec, 20th, 2013
Reed JW, Jiao Y, Potok TE, Klump BA, Elmore MT, Hurson AR (2006) TF-ICF: A new term weighting scheme for clustering dynamic data streams. In: 5th International Conference on Machine Learning and Applications (ICMLA-2006), Washington, DC, pp 258–263
Romary L, Zeldes A, Zipser F (2011) [Tiger2/] – serialising the ISO SynAF syntactic object model. Arxiv preprint arXiv:11080631
Schneider R (2007) A database-driven ontology for German grammar. In: Rehm G, Witt A, Lemnitzer L (eds) Data structures for linguistic resources and applications, Narr, Tübingen, pp 305–314
Schuurman I, Windhouwer M (2011) Explicit semantics for enriched documents. What do ISOcat, RELcat and SCHEMAcat have to offer?. In: 2nd Supporting Digital Humanities Conference, Copenhagen, Denmark
Su L, Sung L, Huang S, Hsieh F, Lin Z (2008) NTU corpus of Formosan languages: A state-of-the-art report. Corpus Linguist Lingust Theory 4(2):291–294
Telljohann H, Hinrichs E, Kübler S, Zinsmeister H, Beck K (2003) Stylebook for the Tübingen treebank of written German (TüBa-D/Z). Technical report, Seminar für Sprachwissenschaft, Universität Tübingen, Germany
Tramp S, Frischmuth P, Arndt N, Ermilov T, Auer S (2011) Weaving a distributed, semantic social network for mobile users. In: 8th Extended Semantic Web Conference (ESWC-2011), Heraklion, Crete, pp 200–214
Tyers F, Wiechetek L, Trosterud T (2009) Developing prototypes for machine translation between two Sámi languages. In: 13th Annual Conference of the European Association for Machine Translation (EAMT-2009), Barcelona, Spain, pp 120–127
Vatant B, Wick M (2012) GeoNames ontology. http://www.geonames.org/ontology. Accessed 31 July 2012, version 3.01
Weibel S, Kunze J, Lagoze C, Wolf M (1998) RFC 2413 – Dublin core metadata for resource discovery. http://www.ietf.org/rfc/rfc2413.txt. Accessed 31 July 2012, Network Working Group
Windhouwer M, Wright S (2012) Linking to linguistic data categories in ISOcat. In: Chiarcos C, Nordhoff S, Hellmann S (eds) Linked data in linguistics, Springer, Heidelberg, pp 99–107
Acknowledgements
In parts, this chapter is based on a number of earlier conference presentations, including [14, 18] and [38]. We would like to thank the contributors to these papers: Jonas Brekle, Philipp Cimiano, Judith Eckle-Kohler, Iryna Gurevych, Silvana Hartmann, Sebastian Hellmann, Jens Lehmann, Michael Matuschek, John McCrae, Christian M. Meyer, and Claus Stadler. We would also like to thank all other OWLG members, as well as the participants of LDL-2012. Further, we would like to express our gratitude towards the anonymous reviewers for feedback and comments. The research of the first author was partially supported by a DAAD postdoctoral fellowship at the Information Sciences Institute of the University of Southern California.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Chiarcos, C., Moran, S., Mendes, P.N., Nordhoff, S., Littauer, R. (2013). Building a Linked Open Data Cloud of Linguistic Resources: Motivations and Developments. In: Gurevych, I., Kim, J. (eds) The People’s Web Meets NLP. Theory and Applications of Natural Language Processing. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35085-6_12
Download citation
DOI: https://doi.org/10.1007/978-3-642-35085-6_12
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35084-9
Online ISBN: 978-3-642-35085-6
eBook Packages: Computer ScienceComputer Science (R0)