Skip to main content
Log in

A Big Data Case Study in Digital Humanities

Creating a Performance Benchmark for Canonical Text Services

  • Schwerpunktbeitrag
  • Published:
Datenbank-Spektrum Aims and scope Submit manuscript

Abstract

While the volume of primary data in the text oriented humanities is small in comparison to the terabytes that are nowadays standard in Big Data applications, secondary data that are the result of scholarly annotations require a fine-grained hierarchical structure based reference model for primary data. The paper provides an attempt for a reusable performance benchmark for Canonical Text Services, a service to access and retrieve text content and structural meta information for hierarchically structured texts, and shows how it can be used to evaluate the technical performance of such a system.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. The result from MySQL’s AUTO_INCREMENT is not necessarily gap-free.

  2. SELECT urn WHERE urn LIKE BINARY “urn:cts:pbc:bible.parallel.eng.kingjames:2.1.%”.

  3. The dot . and the colon :.

  4. SELECT URN WHERE urn LIKE “[URN]%” AND urn LIKE BINARY “[URN]%”.

  5. The text column has been indexed due to the implementation of the full-text search described in [14], but this additional index is not required for the CTS index.

  6. JAVA was chosen because of its widespread support and its uncomplicated use as web applications(Servlets).

  7. The CTS URN urn:cts:pbc:bible.parallel.eng.kingjames:2.1.2 is 46 characters long.

  8. A B-Tree that is processed in such a way that input and output are equal to that of a trie.

  9. Including the speed of the network itself, possible internal proxy redirects, additional server traffic, and even the performance that a specific browser software provides.

  10. See https://developer.ted.com/.

  11. The incorrect document-level meta information in the TED subtitle transcripts cannot be repaired because the API is closed. This is not a problem for a performance benchmark because its purpose is not to validate the content.

  12. pbc:657,936, dta:16,438,119, ted:15,292,408.

  13. dta2, dta3, ted2, dta1, ted3, dta4, ted1, ted4, pbc.

  14. For example, Sentence 1 in Chap. 2 to Sentence 3 in Chap. 4.

  15. JAVA’s default StringTokenizer is used with space, tab, newline, carriage-return, and form-feed as hardcoded delimiters. If these are not applicable to a specific language, then the sub-passage is the non-tokenized text, which is also a correct request.

  16. AuthenticAMD Common KVM Processor.

  17. Debian 8.5 3.167-ckt25-2 /2016-04-08) x86_64, codename Jessie.

  18. AMD Opteron 6234.

References

  1. Blackwell C, Roughan C, Smith DN (2017) Citation and alignment: scholarship outside and inside the codex. Manuscript Studies, Bd. 1

    Google Scholar 

  2. Brass P (2008) Advanced data structures. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

  3. Corman TH, Leiserson CE, Rivest RLS, Stein C (2001) Introduction to Algorithms, 2. Aufl. MIT Press, Cambridge, Massachusetts

    Google Scholar 

  4. Fielding T (2000) Architectural styles and design of network-based software architectures. University of California, Oakland, California

    Google Scholar 

  5. Geyken A, Haaf S, Jurish B, Schulz M, Steinmann J, Thomas C, Wiegand F (2011) Das Deutsche Textarchiv: Vom historischen Korpus zum aktiven Archiv. In: Digitale Wissenschaft Stand und Entwicklung digital vernetzter Forschung in Deutschland, Bd. 2

    Google Scholar 

  6. Henrich A, Heyer G, Schlieder C, Haerder T (2015) Editorial. Datenbank Spektrum 15(1):1–6

    Article  MathSciNet  Google Scholar 

  7. Mayer T, Cysouw M (2014) Creating a massively parallel Bible corpus. In: Proceedings of LREC

    Google Scholar 

  8. McCarty W (2005) Humanities computing. Palgrave, Basingstoke

    Book  Google Scholar 

  9. Nah FF-H (2004) A study on tolerable waiting time: how long are Web users willing to wait? Behav Inf Technol 23(3):153–163

    Article  Google Scholar 

  10. Schneider R (2012) Evaluating DBMS-based access strategies to very large multi-layer corpora. Proceedings of the LREC-2012 Workshop on Challenges in the Management of Large Corpora. Istanbul

  11. Smith DN (2009) Citation in classical studies. Digit Humanit Q 3(1). http://www.digitalhumanities.org/dhq/vol/3/1/000028/000028.html. Accessed: 04.12.2018

  12. Text-Encoding-Initiative (2007) TEI guidelines for electronic text encoding and interchange P5. http://www.tei-c.org/Guidelines. Accessed: 04.12.2018

    Google Scholar 

  13. Tiepmar J (2018) Implementation and evaluation of the canonical text service protocol as part of a research infrastructure in the digital humanities. Leipzig University, Leipzig (Phd Thesis)

    Google Scholar 

  14. Tiepmar J, Heyer G (2017) An overview of canonical text services. Linguistics and Literature Studies

    Book  Google Scholar 

  15. Tiepmar J, Teichmann C, Heyer G, Berti M, Crane G (2013) A new Implementation for Canonical Text Services. In: Proceedings of the 8th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH)

    Google Scholar 

  16. Williamson DF, Parker RA, Kendrick JS (1989) The box plot: a simple visual method to interpret data. Ann Intern Med 110:916–921

    Article  Google Scholar 

Download references

Acknowledgements

Part of this work was funded by the German Federal Ministry of Education and Research within the project ScaDS Dresden/Leipzig (BMBF 01IS14014B).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jochen Tiepmar.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Heyer, G., Tiepmar, J. A Big Data Case Study in Digital Humanities. Datenbank Spektrum 19, 41–49 (2019). https://doi.org/10.1007/s13222-018-00302-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13222-018-00302-7

Keywords

Navigation