skip to main content
research-article

Mass Digitization of Early Modern Texts With Optical Character Recognition

Published: 07 December 2017 Publication History

Abstract

Optical character recognition (OCR) engines work poorly on texts published with premodern printing technologies. Engaging the key technological contributors from the IMPACT project, an earlier project attempting to solve the OCR problem for early modern and modern texts, the Early Modern OCR Project (eMOP) of Texas A8M received funding from the Andrew W. Mellon Foundation to improve OCR outputs for early modern texts from the Eighteenth Century Collections Online (ECCO) and Early English Books Online (EEBO) proprietary database products—or some 45 million pages. Added to print problems are the poor quality of the page images in these collections, which would be too time consuming and expensive to reimage. This article describes eMOP's attempts to OCR 307,000 documents digitized from microfilm to make our cultural heritage available for current and future researchers. We describe the reasoning behind our choices as we undertook the project based on other relevant studies; discoveries we made; the data and the system we developed for processing it; the software, algorithms, training procedures, and tools that we developed; and future directions that should be taken for further work in developing OCR engines for cultural heritage materials.

References

[1]
E. Niggemann, J. D. Decker, and M. Lévy. 2011. The New Renaissance: Report of the “Comité des Sages.” Office of the European Union.
[2]
L. Mandell. 2017. What can you do with ‘dirty OCR’? Digital literary history beyond the canon. Presented at Instant History, the Postwar Digital Humanities and Their Legacies: A Day Conference.
[3]
A. Gupta, R. Gutierrez-Osuna, M. Christy, C. Boris, A. Loretta, L. Grumbach, R. Furuta, and L. Mandell. 2015. Automatic assessment of OCR quality in historical documents. In Proceedings of the 29th AAAI Conference on Artificial Intelligence (AAAI’15). 1735--1741.
[4]
G. Crane. 1987. From the old to the new: Integrating hypertext into traditional scholarship. In Proceedings of the ACM Conference on Hypertext (HYPERTEXT’87). 51--55.
[5]
R. Smith. 1995. A simple and efficient skew detection algorithm via text row accumulation. In Proceedings of the 3rd International Conference on Document Analysis and Recognition (ICDAR’95). 1145.
[6]
R. Smith. 2007. An overview of the Tesseract OCR engine. In Proceedings of the 9th International Conference on Document Analysis and Recognition (ICDAR’07).
[7]
U. Reffle and C. Ringlstetter. 2013. Unsupervised profiling of OCRed historical documents. Pattern Recognition 46, 5, 1346--1357.
[8]
M. Reynaert. 2008. Non-interactive OCR post-correction for giga-scale digitization projects. In Proceedings of the 9th International Conference on Computational Linguistics and Intelligent Text Processing. 617--630.
[9]
B. Alex, C. Grover, E. Klein, and R. Tobin. 2012. Digitised historical text: Does it have to be mediOCRe? In Proceedings of KONVENS 2012 (LThist 2012 Workshop). 401--409.
[10]
P. Ye and D. Doermann. 2013. Document image quality assessment: A brief survey. In Proceedings of the 2013 12th Conference on Document Analysis and Recognition (ICDAR’13).
[11]
R. D. Lins, S. Banergee, and M. Thielo. 2010. Automatically detecting and classifying noises in document images. In Proceedings of the 2010 ACM Symposium on Applied Computing (SAC’10). 33--39.
[12]
N. Sandhya, R. Krishnan, and D. Babu. 2012. A language independent characterization of document image noise in historical scripts. International Journal of Computer Applications 50, 11--18.
[13]
A. Farahmand, A. Sarrafzadeh, and J. Shanbehzadeh. 2013. Document image noises and removal methods. In Proceedings of the International Multiconference of Engineers and Computer Scientists.
[14]
K. Ait-Mohand, L. Heutte, T. Paquet, and N. Ragot. 2010. Font adaptation of an HMM-based OCR system. In Proceedings of SPIE 7534: Document Recognition and Retrieval XVII.
[15]
D. Ghosh, T. Dube, and A. P. Shivaprasad. 2010. Script recognition—a review. IEEE Transactions on Pattern Analysis and Machine Intelligence 32, 12, 2142--2161.
[16]
R. Rani, R. Dhir, and G. S. Lehal. 2013. Script identification of pre-segmented multi-font characters and digits. In Proceedings of the 2013 12th International Conference on Document Analysis and Recognition (ICDAR’13). 1150--1154.
[17]
G. Schohn and D. Cohn. 2000. Less is more: Active learning with support vector machines. In Proceedings of the International Conference on Machine Learning. 839--846.
[18]
Y. Fu, X. Zhu, and B. Li. 2013. A survey on instance selection for active learning. Knowledge and Information Systems 35, 249--283.
[19]
M.-R. Bouguelia, Y. Belaïd, and A. Belaïd. 2013. A stream-based semi-supervised active learning approach for document classification. In Proceedings of the International Conference on Document Analysis and Recognition. 611--615.
[20]
G. B. Newby and C. Franks. 2003. Distributed proofreading. In Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital Libraries.
[21]
L. von Ahn. 2006. Games with a purpose. Computer 39, 6, 92--94.
[22]
L. von Ahn and L. Dabbish. 2008. Designing games with a purpose. Communications of the ACM 51, 8, 58--67.
[23]
L. von Ahn, B. Maurer, C. McMillen, D. Abraham, and M. Blum. 2008. reCAPTCHA: Human-based character recognition via Web security measures. Science 321, 5895, 1465--1468.
[24]
S. La Manna, A. Colia, and A. Sperduti. 1999. Optical font recognition for multi-font OCR and document processing. In Proceedings of the 10th International Workshop on Database and Expert Systems Applications. 549--553.
[25]
M. B. Imani, M. R. Keyvanpour, and R. Azmi. 2011. Semi-supervised Persian font recognition. Procedia Computer Science 3, 336--342.
[26]
R. C. Gonzalez and R. E. Woods. 2007. Digital Image Processing (3rd ed.). Prentice Hall.
[27]
E. Kavallieratou, N. Fakotakis, and G. Kokkinakis. 2002. Skew angle estimation for printed and handwritten documents using the Wigner--Ville distribution. Image and Vision Computing 20, 813--824.
[28]
J. Illingworth and J. Kittler. 1988. A survey of the Hough transform. Computer Vision, Graphics, and Image Processing 44, 1, 87--116.
[29]
A. Khotanzad and Y. H. Hong. 1990. Invariant image recognition by Zernike moments. IEEE Transactions on Pattern Analysis and Machine Intelligence 12, 5, 489--497.
[30]
A. Tahmasbi, F. Saki, and S. B. Shokouhi. 2011. Classification of benign and malignant masses based on Zernike moments. Computers in Biology and Medicine 41, 8, 726--735.
[31]
C. Wolf, G. Taylor, and J.-M. Jolion. 2011. Learning Individual Human Activities From Short Binary Shape Sequences. Technical Report LIRIS. Available at http://liris.cnrs.fr/Documents/Liris-5294.pdf.
[32]
J. Sivic and A. Zisserman. 2003. Video google: A text retrieval approach to object matching in videos. In Proceedings of the 9th IEEE International Conference on Computer Vision. 1470--1477.
[33]
T. Kobayashi, K. Watanabe, and N. Otsu. 2012. Logistic label propagation. Pattern Recognition Letters 33, 5, 580--588.
[34]
B. Settles. 2012. Active Learning: Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan 8 Claypool.
[35]
K. Black. 2004. Booklist/Reference Books Bulletin, November 1.

Cited By

View all
  • (2024)OCR Approaches for Humanities: Applications of Artificial Intelligence/Machine Learning on Transcription and Transliteration of Historical DocumentsDigital Studies in Language and Literature10.1515/dsll-2024-00131:1-2(85-112)Online publication date: 2-Dec-2024
  • (2024)Interdisciplinarity in the 17th century? A co-occurrence analysis of early modern German dissertation titlesSynthese10.1007/s11229-024-04494-2203:2Online publication date: 15-Feb-2024
  • (2023)Upcycling historical data collections. A paradigm for digital history?Journal of Documentation10.1108/JD-12-2022-027179:6(1325-1345)Online publication date: 28-Mar-2023
  • Show More Cited By

Index Terms

  1. Mass Digitization of Early Modern Texts With Optical Character Recognition

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Journal on Computing and Cultural Heritage
    Journal on Computing and Cultural Heritage   Volume 11, Issue 1
    Special Issue on GCH 2016 and Regular Papers
    January 2018
    116 pages
    ISSN:1556-4673
    EISSN:1556-4711
    DOI:10.1145/3172938
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 07 December 2017
    Accepted: 01 March 2017
    Revised: 01 February 2017
    Received: 01 April 2016
    Published in JOCCH Volume 11, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Machine learning
    2. digital humanities

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    • Andrew W. Mellon Foundation

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)55
    • Downloads (Last 6 weeks)7
    Reflects downloads up to 20 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)OCR Approaches for Humanities: Applications of Artificial Intelligence/Machine Learning on Transcription and Transliteration of Historical DocumentsDigital Studies in Language and Literature10.1515/dsll-2024-00131:1-2(85-112)Online publication date: 2-Dec-2024
    • (2024)Interdisciplinarity in the 17th century? A co-occurrence analysis of early modern German dissertation titlesSynthese10.1007/s11229-024-04494-2203:2Online publication date: 15-Feb-2024
    • (2023)Upcycling historical data collections. A paradigm for digital history?Journal of Documentation10.1108/JD-12-2022-027179:6(1325-1345)Online publication date: 28-Mar-2023
    • (2023)EEBO-Verse: Sifting for Poetry in Large Early Modern Corpora Using Visual FeaturesDocument Analysis and Recognition - ICDAR 202310.1007/978-3-031-41734-4_3(36-52)Online publication date: 21-Aug-2023
    • (2022)Data and Process Quality Evaluation in a Textual Big Data Archiving SystemJournal on Computing and Cultural Heritage 10.1145/346101515:1(1-19)Online publication date: 20-Mar-2022
    • (2019)Shall deep learning be the mandatory future of document analysis problems?Pattern Recognition10.1016/j.patcog.2018.09.01086(281-289)Online publication date: Feb-2019
    • (2019)Efficient and effective OCR engine trainingInternational Journal on Document Analysis and Recognition10.1007/s10032-019-00347-823:1(73-88)Online publication date: 30-Oct-2019
    • (undefined)An Effectual Optical Character Recognition Using Efficient Learning SystemSSRN Electronic Journal10.2139/ssrn.3358254

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media