Skip to main content

cqp4rdf: Towards a Suite for RDF-Based Corpus Linguistics

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12124))

Abstract

In this paper, we present cqp4rdf, a set of tools for creating and querying corpora with linguistic annotations. cqp4rdf builds on CQP, an established corpus query language widely used in the areas of computational lexicography and empirical linguistics, and allows to apply it to corpora represented in RDF.

This is in line with the emerging trend of RDF-based corpus formats that provides several benefits over more traditional ways, such as support for virtually unlimited types of annotation, linking of corpus elements between multiple datasets, and simultaneously querying distributed language resources and corpora with different annotations.

On the other hand, application support tailored for such corpora is virtually nonexistent, leaving corpus linguist with SPARQL as the query language. Being extremely powerful, it has a relatively steep learning curve, especially for people without computer science background. At the same time, using query languages designed for classic corpus management software limits the vast possibilities of RDF-based corpora.

We present the middle ground aiming to bridge the gap: the interface that allows to query RDF corpora and explore the results in a linguist-friendly way.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    Total time spent on writing all the necessary queries for  [4] was more than a week, it was done by a developer in tandem with a linguist.

  2. 2.

    https://purl.org/liodi/cqp4rdf.

  3. 3.

    Sometimes there is a confusion with CQL, which is another query language, still, some systems use CQL as the name for CQP.

  4. 4.

    http://purl.org/liodi/cqp4rdf/ud.

  5. 5.

    Key-word in context.

  6. 6.

    For brevity, we use non-normative SPARQL 1.1 Property Path (W3C Working Draft 26.01.2010), which is supported by some triple stores as an extension.

  7. 7.

    This is, of course, not a problem of SPARQL but a result of using an intermediate conversion, which hides the data model under the hood.

References

  1. Berners-Lee, T.: Linked data. Technical report, W3C Design Issue (2006)

    Google Scholar 

  2. Burchardt, A., Padó, S., Spohr, D., Frank, A., Heid, U.: Formalising multi-layer corpora in OWL/DL - Lexicon modelling, querying and consistency control. In: Proceedings of the IJCNLP-2008 (2008)

    Google Scholar 

  3. Chiarcos, C., Nordhoff, S., Hellmann, S. (eds.): Linked Data in Linguistics. Representing Language Data and Metadata. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-28249-2

    Book  Google Scholar 

  4. Chiarcos, C., Donandt, K., Sargsian, H., Ionov, M., Schreur, J.W.: Towards LLOD-based language contact studies. A case study in interoperability. In: Proceedings of the LREC 2018 (2018)

    Google Scholar 

  5. Chiarcos, C., Fäth, C.: CoNLL-RDF: linked corpora done in an NLP-friendly way. In: Gracia, J., Bond, F., McCrae, J.P., Buitelaar, P., Chiarcos, C., Hellmann, S. (eds.) LDK 2017. LNCS (LNAI), vol. 10318, pp. 74–88. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-59888-8_6

    Chapter  Google Scholar 

  6. Chiarcos, C., McCrae, J., Cimiano, P., Fellbaum, C.: Towards open data for linguistics: linguistic linked data. In: Oltramari, A., Vossen, P., Qin, L., Hovy, E. (eds.) New Trends of Research in Ontologies and Lexical Resources. NLP. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-31782-8_2

    Chapter  Google Scholar 

  7. Christ, O.: The IMS corpus workbench technical manual. Institut für maschinelle Sprachverarbeitung, Universität Stuttgart (1994)

    Google Scholar 

  8. Cimiano, P., McCrae, J., Buitelaar, P.: Lexicon model for ontologies. Technical report, W3C Community Report (2016)

    Google Scholar 

  9. Farrar, S., Langendoen, D.T.: A linguistic ontology for the semantic web. GLOT Int. 7(3), 97–100 (2003)

    Google Scholar 

  10. Frank, A., Ivanovic, C.: Building literary corpora for computational literary analysis-a prototype to bridge the gap between CL and DH. In: Proceedings of the LREC 2018 (2018)

    Google Scholar 

  11. Hellmann, S., Lehmann, J., Auer, S., Brümmer, M.: Integrating NLP using linked data. In: Alani, H., et al. (eds.) ISWC 2013, Part II. LNCS, vol. 8219, pp. 98–113. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41338-4_7

    Chapter  Google Scholar 

  12. ISO: ISO 24612:2012. Language resource management - linguistic annotation framework. Technical report (2012)

    Google Scholar 

  13. Kilgarriff, A., et al.: The sketch engine: ten years on. Lexicography 1, 7–36 (2014). https://doi.org/10.1007/s40607-014-0009-9

    Article  Google Scholar 

  14. Mazziotta, N.: Building the syntactic reference corpus of medieval French using NotaBene RDF annotation tool. In: Proceedings of the 4th Linguistic Annotation Workshop (LAW) (2010)

    Google Scholar 

  15. Sanderson, R., Ciccarese, P., Young, B.: Web annotation data model. Technical report, W3C Recommendation 23 February 2017 (2017)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Maxim Ionov .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ionov, M., Stein, F., Sehgal, S., Chiarcos, C. (2020). cqp4rdf: Towards a Suite for RDF-Based Corpus Linguistics. In: Harth, A., et al. The Semantic Web: ESWC 2020 Satellite Events. ESWC 2020. Lecture Notes in Computer Science(), vol 12124. Springer, Cham. https://doi.org/10.1007/978-3-030-62327-2_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-62327-2_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-62326-5

  • Online ISBN: 978-3-030-62327-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics