Abstract
While the volume of primary data in the text oriented humanities is small in comparison to the terabytes that are nowadays standard in Big Data applications, secondary data that are the result of scholarly annotations require a fine-grained hierarchical structure based reference model for primary data. The paper provides an attempt for a reusable performance benchmark for Canonical Text Services, a service to access and retrieve text content and structural meta information for hierarchically structured texts, and shows how it can be used to evaluate the technical performance of such a system.
Similar content being viewed by others
Notes
The result from MySQL’s AUTO_INCREMENT is not necessarily gap-free.
SELECT urn WHERE urn LIKE BINARY “urn:cts:pbc:bible.parallel.eng.kingjames:2.1.%”.
The dot . and the colon :.
SELECT URN WHERE urn LIKE “[URN]%” AND urn LIKE BINARY “[URN]%”.
The text column has been indexed due to the implementation of the full-text search described in [14], but this additional index is not required for the CTS index.
JAVA was chosen because of its widespread support and its uncomplicated use as web applications(Servlets).
The CTS URN urn:cts:pbc:bible.parallel.eng.kingjames:2.1.2 is 46 characters long.
A B-Tree that is processed in such a way that input and output are equal to that of a trie.
Including the speed of the network itself, possible internal proxy redirects, additional server traffic, and even the performance that a specific browser software provides.
See https://developer.ted.com/.
The incorrect document-level meta information in the TED subtitle transcripts cannot be repaired because the API is closed. This is not a problem for a performance benchmark because its purpose is not to validate the content.
pbc:657,936, dta:16,438,119, ted:15,292,408.
dta2, dta3, ted2, dta1, ted3, dta4, ted1, ted4, pbc.
For example, Sentence 1 in Chap. 2 to Sentence 3 in Chap. 4.
JAVA’s default StringTokenizer is used with space, tab, newline, carriage-return, and form-feed as hardcoded delimiters. If these are not applicable to a specific language, then the sub-passage is the non-tokenized text, which is also a correct request.
AuthenticAMD Common KVM Processor.
Debian 8.5 3.167-ckt25-2 /2016-04-08) x86_64, codename Jessie.
AMD Opteron 6234.
References
Blackwell C, Roughan C, Smith DN (2017) Citation and alignment: scholarship outside and inside the codex. Manuscript Studies, Bd. 1
Brass P (2008) Advanced data structures. Cambridge University Press, Cambridge
Corman TH, Leiserson CE, Rivest RLS, Stein C (2001) Introduction to Algorithms, 2. Aufl. MIT Press, Cambridge, Massachusetts
Fielding T (2000) Architectural styles and design of network-based software architectures. University of California, Oakland, California
Geyken A, Haaf S, Jurish B, Schulz M, Steinmann J, Thomas C, Wiegand F (2011) Das Deutsche Textarchiv: Vom historischen Korpus zum aktiven Archiv. In: Digitale Wissenschaft Stand und Entwicklung digital vernetzter Forschung in Deutschland, Bd. 2
Henrich A, Heyer G, Schlieder C, Haerder T (2015) Editorial. Datenbank Spektrum 15(1):1–6
Mayer T, Cysouw M (2014) Creating a massively parallel Bible corpus. In: Proceedings of LREC
McCarty W (2005) Humanities computing. Palgrave, Basingstoke
Nah FF-H (2004) A study on tolerable waiting time: how long are Web users willing to wait? Behav Inf Technol 23(3):153–163
Schneider R (2012) Evaluating DBMS-based access strategies to very large multi-layer corpora. Proceedings of the LREC-2012 Workshop on Challenges in the Management of Large Corpora. Istanbul
Smith DN (2009) Citation in classical studies. Digit Humanit Q 3(1). http://www.digitalhumanities.org/dhq/vol/3/1/000028/000028.html. Accessed: 04.12.2018
Text-Encoding-Initiative (2007) TEI guidelines for electronic text encoding and interchange P5. http://www.tei-c.org/Guidelines. Accessed: 04.12.2018
Tiepmar J (2018) Implementation and evaluation of the canonical text service protocol as part of a research infrastructure in the digital humanities. Leipzig University, Leipzig (Phd Thesis)
Tiepmar J, Heyer G (2017) An overview of canonical text services. Linguistics and Literature Studies
Tiepmar J, Teichmann C, Heyer G, Berti M, Crane G (2013) A new Implementation for Canonical Text Services. In: Proceedings of the 8th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH)
Williamson DF, Parker RA, Kendrick JS (1989) The box plot: a simple visual method to interpret data. Ann Intern Med 110:916–921
Acknowledgements
Part of this work was funded by the German Federal Ministry of Education and Research within the project ScaDS Dresden/Leipzig (BMBF 01IS14014B).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Heyer, G., Tiepmar, J. A Big Data Case Study in Digital Humanities. Datenbank Spektrum 19, 41–49 (2019). https://doi.org/10.1007/s13222-018-00302-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13222-018-00302-7