Abstract
In this paper, we present cqp4rdf, a set of tools for creating and querying corpora with linguistic annotations. cqp4rdf builds on CQP, an established corpus query language widely used in the areas of computational lexicography and empirical linguistics, and allows to apply it to corpora represented in RDF.
This is in line with the emerging trend of RDF-based corpus formats that provides several benefits over more traditional ways, such as support for virtually unlimited types of annotation, linking of corpus elements between multiple datasets, and simultaneously querying distributed language resources and corpora with different annotations.
On the other hand, application support tailored for such corpora is virtually nonexistent, leaving corpus linguist with SPARQL as the query language. Being extremely powerful, it has a relatively steep learning curve, especially for people without computer science background. At the same time, using query languages designed for classic corpus management software limits the vast possibilities of RDF-based corpora.
We present the middle ground aiming to bridge the gap: the interface that allows to query RDF corpora and explore the results in a linguist-friendly way.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
Total time spent on writing all the necessary queries for [4] was more than a week, it was done by a developer in tandem with a linguist.
- 2.
- 3.
Sometimes there is a confusion with CQL, which is another query language, still, some systems use CQL as the name for CQP.
- 4.
- 5.
Key-word in context.
- 6.
For brevity, we use non-normative SPARQL 1.1 Property Path (W3C Working Draft 26.01.2010), which is supported by some triple stores as an extension.
- 7.
This is, of course, not a problem of SPARQL but a result of using an intermediate conversion, which hides the data model under the hood.
References
Berners-Lee, T.: Linked data. Technical report, W3C Design Issue (2006)
Burchardt, A., Padó, S., Spohr, D., Frank, A., Heid, U.: Formalising multi-layer corpora in OWL/DL - Lexicon modelling, querying and consistency control. In: Proceedings of the IJCNLP-2008 (2008)
Chiarcos, C., Nordhoff, S., Hellmann, S. (eds.): Linked Data in Linguistics. Representing Language Data and Metadata. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-28249-2
Chiarcos, C., Donandt, K., Sargsian, H., Ionov, M., Schreur, J.W.: Towards LLOD-based language contact studies. A case study in interoperability. In: Proceedings of the LREC 2018 (2018)
Chiarcos, C., Fäth, C.: CoNLL-RDF: linked corpora done in an NLP-friendly way. In: Gracia, J., Bond, F., McCrae, J.P., Buitelaar, P., Chiarcos, C., Hellmann, S. (eds.) LDK 2017. LNCS (LNAI), vol. 10318, pp. 74–88. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-59888-8_6
Chiarcos, C., McCrae, J., Cimiano, P., Fellbaum, C.: Towards open data for linguistics: linguistic linked data. In: Oltramari, A., Vossen, P., Qin, L., Hovy, E. (eds.) New Trends of Research in Ontologies and Lexical Resources. NLP. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-31782-8_2
Christ, O.: The IMS corpus workbench technical manual. Institut für maschinelle Sprachverarbeitung, Universität Stuttgart (1994)
Cimiano, P., McCrae, J., Buitelaar, P.: Lexicon model for ontologies. Technical report, W3C Community Report (2016)
Farrar, S., Langendoen, D.T.: A linguistic ontology for the semantic web. GLOT Int. 7(3), 97–100 (2003)
Frank, A., Ivanovic, C.: Building literary corpora for computational literary analysis-a prototype to bridge the gap between CL and DH. In: Proceedings of the LREC 2018 (2018)
Hellmann, S., Lehmann, J., Auer, S., Brümmer, M.: Integrating NLP using linked data. In: Alani, H., et al. (eds.) ISWC 2013, Part II. LNCS, vol. 8219, pp. 98–113. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41338-4_7
ISO: ISO 24612:2012. Language resource management - linguistic annotation framework. Technical report (2012)
Kilgarriff, A., et al.: The sketch engine: ten years on. Lexicography 1, 7–36 (2014). https://doi.org/10.1007/s40607-014-0009-9
Mazziotta, N.: Building the syntactic reference corpus of medieval French using NotaBene RDF annotation tool. In: Proceedings of the 4th Linguistic Annotation Workshop (LAW) (2010)
Sanderson, R., Ciccarese, P., Young, B.: Web annotation data model. Technical report, W3C Recommendation 23 February 2017 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Ionov, M., Stein, F., Sehgal, S., Chiarcos, C. (2020). cqp4rdf: Towards a Suite for RDF-Based Corpus Linguistics. In: Harth, A., et al. The Semantic Web: ESWC 2020 Satellite Events. ESWC 2020. Lecture Notes in Computer Science(), vol 12124. Springer, Cham. https://doi.org/10.1007/978-3-030-62327-2_20
Download citation
DOI: https://doi.org/10.1007/978-3-030-62327-2_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-62326-5
Online ISBN: 978-3-030-62327-2
eBook Packages: Computer ScienceComputer Science (R0)