A Big Data Case Study in Digital Humanities

Heyer, Gerhard; Tiepmar, Jochen

doi:10.1007/s13222-018-00302-7

A Big Data Case Study in Digital Humanities

Creating a Performance Benchmark for Canonical Text Services

Schwerpunktbeitrag
Published: 10 December 2018

Volume 19, pages 41–49, (2019)
Cite this article

Datenbank-Spektrum Aims and scope Submit manuscript

Gerhard Heyer¹ &
Jochen Tiepmar^1,2

386 Accesses
2 Citations
Explore all metrics

Abstract

While the volume of primary data in the text oriented humanities is small in comparison to the terabytes that are nowadays standard in Big Data applications, secondary data that are the result of scholarly annotations require a fine-grained hierarchical structure based reference model for primary data. The paper provides an attempt for a reusable performance benchmark for Canonical Text Services, a service to access and retrieve text content and structural meta information for hierarchically structured texts, and shows how it can be used to evaluate the technical performance of such a system.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Ecosystem for Linked Humanities Data

A Research Toolbox: A Complete Suite for Analysis in Digital Humanities

Publishing Bibliographic Records on the Web of Data: Opportunities for the BnF (French National Library)

Notes

The result from MySQL’s AUTO_INCREMENT is not necessarily gap-free.
SELECT urn WHERE urn LIKE BINARY “urn:cts:pbc:bible.parallel.eng.kingjames:2.1.%”.
The dot . and the colon :.
SELECT URN WHERE urn LIKE “[URN]%” AND urn LIKE BINARY “[URN]%”.
The text column has been indexed due to the implementation of the full-text search described in [14], but this additional index is not required for the CTS index.
JAVA was chosen because of its widespread support and its uncomplicated use as web applications(Servlets).
The CTS URN urn:cts:pbc:bible.parallel.eng.kingjames:2.1.2 is 46 characters long.
A B-Tree that is processed in such a way that input and output are equal to that of a trie.
Including the speed of the network itself, possible internal proxy redirects, additional server traffic, and even the performance that a specific browser software provides.
See https://developer.ted.com/.
The incorrect document-level meta information in the TED subtitle transcripts cannot be repaired because the API is closed. This is not a problem for a performance benchmark because its purpose is not to validate the content.
pbc:657,936, dta:16,438,119, ted:15,292,408.
dta2, dta3, ted2, dta1, ted3, dta4, ted1, ted4, pbc.
For example, Sentence 1 in Chap. 2 to Sentence 3 in Chap. 4.
JAVA’s default StringTokenizer is used with space, tab, newline, carriage-return, and form-feed as hardcoded delimiters. If these are not applicable to a specific language, then the sub-passage is the non-tokenized text, which is also a correct request.
AuthenticAMD Common KVM Processor.
Debian 8.5 3.167-ckt25-2 /2016-04-08) x86_64, codename Jessie.
AMD Opteron 6234.

References

Blackwell C, Roughan C, Smith DN (2017) Citation and alignment: scholarship outside and inside the codex. Manuscript Studies, Bd. 1
Google Scholar
Brass P (2008) Advanced data structures. Cambridge University Press, Cambridge
Book MATH Google Scholar
Corman TH, Leiserson CE, Rivest RLS, Stein C (2001) Introduction to Algorithms, 2. Aufl. MIT Press, Cambridge, Massachusetts
Google Scholar
Fielding T (2000) Architectural styles and design of network-based software architectures. University of California, Oakland, California
Google Scholar
Geyken A, Haaf S, Jurish B, Schulz M, Steinmann J, Thomas C, Wiegand F (2011) Das Deutsche Textarchiv: Vom historischen Korpus zum aktiven Archiv. In: Digitale Wissenschaft Stand und Entwicklung digital vernetzter Forschung in Deutschland, Bd. 2
Google Scholar
Henrich A, Heyer G, Schlieder C, Haerder T (2015) Editorial. Datenbank Spektrum 15(1):1–6
Article MathSciNet Google Scholar
Mayer T, Cysouw M (2014) Creating a massively parallel Bible corpus. In: Proceedings of LREC
Google Scholar
McCarty W (2005) Humanities computing. Palgrave, Basingstoke
Book Google Scholar
Nah FF-H (2004) A study on tolerable waiting time: how long are Web users willing to wait? Behav Inf Technol 23(3):153–163
Article Google Scholar
Schneider R (2012) Evaluating DBMS-based access strategies to very large multi-layer corpora. Proceedings of the LREC-2012 Workshop on Challenges in the Management of Large Corpora. Istanbul
Smith DN (2009) Citation in classical studies. Digit Humanit Q 3(1). http://www.digitalhumanities.org/dhq/vol/3/1/000028/000028.html. Accessed: 04.12.2018
Text-Encoding-Initiative (2007) TEI guidelines for electronic text encoding and interchange P5. http://www.tei-c.org/Guidelines. Accessed: 04.12.2018
Google Scholar
Tiepmar J (2018) Implementation and evaluation of the canonical text service protocol as part of a research infrastructure in the digital humanities. Leipzig University, Leipzig (Phd Thesis)
Google Scholar
Tiepmar J, Heyer G (2017) An overview of canonical text services. Linguistics and Literature Studies
Book Google Scholar
Tiepmar J, Teichmann C, Heyer G, Berti M, Crane G (2013) A new Implementation for Canonical Text Services. In: Proceedings of the 8th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH)
Google Scholar
Williamson DF, Parker RA, Kendrick JS (1989) The box plot: a simple visual method to interpret data. Ann Intern Med 110:916–921
Article Google Scholar

Download references

Acknowledgements

Part of this work was funded by the German Federal Ministry of Education and Research within the project ScaDS Dresden/Leipzig (BMBF 01IS14014B).

Author information

Authors and Affiliations

NLP Group, Leipzig University, Leipzig, Germany
Gerhard Heyer & Jochen Tiepmar
ScaDS, Leipzig University, Leipzig, Germany
Jochen Tiepmar

Authors

Gerhard Heyer
View author publications
You can also search for this author in PubMed Google Scholar
Jochen Tiepmar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jochen Tiepmar.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Heyer, G., Tiepmar, J. A Big Data Case Study in Digital Humanities. Datenbank Spektrum 19, 41–49 (2019). https://doi.org/10.1007/s13222-018-00302-7

Download citation

Received: 01 October 2018
Accepted: 23 November 2018
Published: 10 December 2018
Issue Date: 06 March 2019
DOI: https://doi.org/10.1007/s13222-018-00302-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Big Data Case Study in Digital Humanities

Abstract

Access this article

Similar content being viewed by others

An Ecosystem for Linked Humanities Data

A Research Toolbox: A Complete Suite for Analysis in Digital Humanities

Publishing Bibliographic Records on the Web of Data: Opportunities for the BnF (French National Library)

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Big Data Case Study in Digital Humanities

Abstract

Access this article

Similar content being viewed by others

An Ecosystem for Linked Humanities Data

A Research Toolbox: A Complete Suite for Analysis in Digital Humanities

Publishing Bibliographic Records on the Web of Data: Opportunities for the BnF (French National Library)

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation