Abstract
Historical scientific articles often require Optical Character Recognition (OCR) to transform scanned documents into machine-readable text, a process that often produces errors. We present a pipeline for the generation of a synthetic ground truth/OCR dataset to correct the OCR results of the astrophysics literature holdings of the NASA Astrophysics Data System (ADS). By mining the ar\(\mathrm {\chi }\)iv we create, to the authors’ knowledge, the largest scientific synthetic ground truth/OCR post correction dataset of 203,354,393 character pairs. Baseline models trained with this dataset find the mean improvement in character and word error rates of 7.71% and 18.82% for historical OCR text, respectively. Interactive dashboards to explore the dataset are available online: https://readingtimemachine.github.io/projects/1-ocr-groundtruth-may2023, and data and code, are hosted on GitHub: https://github.com/ReadingTimeMachine/ocr_post_correction.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
For example, following the process in Sect. 3.3, TeXSoup finds errors in only
- 3.
- 4.
References
ar\(\rm {\chi }\)iv bulk downloads. https://info.arxiv.org/help/bulk_data_s3.html. Accessed 05 March 2022
ar\(\rm {\chi }\)iv hiring and needs. https://info.arxiv.org/hiring/. Accessed 17 July 2023
Huggingface byt5-small. https://huggingface.co/google/byt5-small. Accessed 25 Mar 2023
Huggingface yelpfeast/byt5-base-english-ocr-correction. https://huggingface.co/yelpfeast/byt5-base-english-ocr-correction. Accessed 20 July 2023
The levenshtein package. https://github.com/maxbachmann/Levenshtein. Accessed 29 May 2023
Opendetex. https://github.com/pkubowicz/opendetex. Accessed 29 May 2023
The spacy sentence tokenizer. https://spacy.io/api/sentencizer. Accessed 29 May 2023
Texsoup. https://github.com/alvinwan/TexSoup. Accessed 30 Oct 2022
The tikzmark package. https://texdoc.org/serve/tikzmark/0. Accessed 29 May 2023
Accomazzi, A., et al.: Improved functionality and curation support in the ADS. In: American Astronomical Society Meeting Abstracts #225. American Astronomical Society Meeting Abstracts, vol. 225, pp. 336–355, January 2015
Ahuja, A., Devera, A., Fox, E.A.: Parsing electronic theses and dissertations using object detection. In: Proceedings of the first Workshop on Information Extraction from Scientific Publications, pp. 121–130. Association for Computational Linguistics, November 2022. https://aclanthology.org/2022.wiesp-1.14
Boros, E., Nguyen, N.K., Lejeune, G., Doucet, A.: Assessing the impact of OCR noise on multilingual event detection over digitised documents. Int. J. Digit. Librar. 1–26 (2022). https://doi.org/10.1007/s00799-022-00325-2
Chiron, G., Doucet, A., Coustaty, M., Moreux, J.P.: ICDAR2017 2017 competition on post-OCR text correction. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 01, pp. 1423–1428 (2017). https://doi.org/10.1109/ICDAR.2017.232
Eichhorn, G., Accomazzi, A., Grant, C.S., Kurtz, M.J., Rey Bacaicoa, V., Murray, S.S.: New data and search features in the NASA ADS abstract service, p. 1298, March 2002. https://ui.adsabs.harvard.edu/abs/2002LPI....33.1298E, Conference Name: Lunar and Planetary Science Conference ADS Bibcode: 2002LPI....33.1298E
Etter, D., Rawls, S., Carpenter, C., Sell, G.: A synthetic recipe for OCR. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 864–869. IEEE, Sydney, Australia, September 2019. https://doi.org/10.1109/ICDAR.2019.00143
Evershed, J., Fitch, K.: Correcting noisy OCR: context beats confusion. In: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage, DATeCH 2014, pp. 45–51. Association for Computing Machinery, New York, NY, USA (2014). https://doi.org/10.1145/2595188.2595200
Ginev, D., Miller, B.R.: LaTeXML 2012 - a year of LaTeXML. In: Carette, J., Aspinall, D., Lange, C., Sojka, P., Windsteiger, W. (eds.) CICM 2013. LNCS (LNAI), vol. 7961, pp. 335–338. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-39320-4_24
Honnibal, M., Montani, I.: spaCy 2: natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing. To appear 7(1), 411–420 (2017)
Jiang, M., et al.: The gutenberg-hathitrust parallel corpus: a real-world dataset for noise investigation in uncorrected OCR texts. In: iConference 2021 (Poster) (2021)
Kahu, S.Y.: Figure Extraction from Scanned Electronic Theses and Dissertations. Master’s thesis, Virginia Tech (2020)
Krishnan, P., Jawahar, C.: Generating synthetic data for text recognition. arXiv preprint arXiv:1608.04224 (2016)
Le, T.A., Baydin, A.G., Zinkov, R., Wood, F.: Using synthetic data to train neural networks is model-based reasoning. In: 2017 International Joint Conference on Neural Networks (IJCNN), pp. 3514–3521. IEEE (2017)
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Soviet Phys. Doklady 10, 707 (1966)
Li, M., Cui, L., Huang, S., Wei, F., Zhou, M., Li, Z.: TableBank: a benchmark dataset for table detection and recognition, July 2020. http://arxiv.org/abs/1903.01949, arXiv:1903.01949 [cs]
Li, M., et al.: DocBank: a benchmark dataset for document layout analysis. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 949–960 (2020)
Liu, Y., et al.: Multilingual denoising pre-training for neural machine translation. Trans. Assoc. Comput. Linguist. 8, 726–742 (2020). https://doi.org/10.1162/tacl_a_00343
Maheshwari, A., Singh, N., Krishna, A., Ramakrishnan, G.: A Benchmark and dataset for Post-OCR text correction in Sanskrit, November 2022. https://doi.org/10.48550/arXiv.2211.07980, arXiv:2211.07980 [cs]
Mayernik, M.S., Hart, D.L., Maull, K.E., Weber, N.M.: Assessing and tracing the outcomes and impact of research infrastructures. J. Assoc. Inf. Sci. Technol. 68(6), 1341–1359 (2017). https://doi.org/10.1002/asi.23721
Naiman, J.P., Williams, P.K., Goodman, A.: The digitization of historical astrophysical literature with highly localized figures and figure captions. Int. J. Digit. Librar. 1–21 (2023). https://doi.org/10.1007/s00799-023-00350-9
Naiman, J.P., Williams, P.K.G., Goodman, A.: Figure and figure caption extraction for mixed raster and vector PDFs: digitization of astronomical literature with OCR features. In: Silvello, G., et al. Linking Theory and Practice of Digital Libraries. TPDL 2022. Lecture Notes in Computer Science, vol. 13541, pp 52–67. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-16802-4_5
Nguyen, T.T.H., Jatowt, A., Coustaty, M., Nguyen, N.V., Doucet, A.: Deep statistical analysis of OCR errors for effective post-OCR processing. In: 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp. 29–38, June 2019. https://doi.org/10.1109/JCDL.2019.00015
Pepe, A., Goodman, A., Muench, A.: The ADS all-sky survey. In: Ballester, P., Egret, D., Lorente, N.P.F. (eds.) Astronomical Data Analysis Software and Systems XXI. Astronomical Society of the Pacific Conference Series, vol. 461, p. 275, September 2012
Pfahler, L., Morik, K.: Self-supervised pretraining of graph neural network for the retrieval of related mathematical expressions in scientific articles, August 2022. http://arxiv.org/abs/2209.00446, arXiv:2209.00446 [cs]
Ramirez-Orta, J.A., Xamena, E., Maguitman, A., Milios, E., Soto, A.J.: Post-OCR document correction with large ensembles of character sequence-to-sequence models. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 11192–11199 (2022)
Ren, X., Chen, K., Sun, J.: A CNN based scene Chinese text recognition algorithm with synthetic data engine. arXiv e-prints arXiv:1604.01891, https://doi.org/10.48550/arXiv.1604.01891, April 2016
Rigaud, C., Doucet, A., Coustaty, M., Moreux, J.P.: ICDAR 2019 competition on post-OCR text correction (2019)
Rigaud, C., Doucet, A., Coustaty, M., Moreux, J.P.: ICDAR 2019 competition on post-OCR text correction. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1588–1593 (2019). https://doi.org/10.1109/ICDAR.2019.00255
Saier, T., Färber, M.: Bibliometric-enhanced arxiv: a data set for paper-based and citation-based tasks. In: BIR@ ECIR, pp. 14–26 (2019)
Saier, T., Krause, J., Färber, M.: unarXive 2022: all arXiv publications pre-processed for NLP, including structured full-text and citation network. arXiv e-prints arXiv:2303.14957, https://doi.org/10.48550/arXiv.2303.14957. March 2023
Sandy, H.M., et al.: Making a case for open research: implications for reproducibility and transparency. Proc. Assoc. Inf. Sci. Technol. 54(1), 583–586 (2017). https://doi.org/10.1002/pra2.2017.14505401079
Schmitt-Koopmann, F.M., Huang, E.M., Darvishy, A.: Accessible PDFs: applying artificial intelligence for automated remediation of STEM PDFs. In: Proceedings of the 24th International ACM SIGACCESS Conference on Computers and Accessibility, ASSETS 2022, pp. 1–6. Association for Computing Machinery, New York, NY, USA, October 2022. https://doi.org/10.1145/3517428.3550407
Smith, L., Arcand, K., Smith, R., Bookbinder, J., Smith, J.: Capturing the many faces of an exploded star: communicating complex and evolving astronomical data. JCOM J. Sci. Commun. 16, 16050202 (2017). https://doi.org/10.22323/2.16050202
Smith, R.: An overview of the tesseract OCR engine. In: Proceedings of the Ninth International Conference on Document Analysis and Recognition, vol. 02, ICDAR 2007, pp. 629–633. IEEE Computer Society, USA (2007)
Sohmen, L., Charbonnier, J., Blümel, I., Wartena, C., Heller, L.: Figures in scientific open access publications. In: Méndez, E., Crestani, F., Ribeiro, C., David, G., Lopes, J.C. (eds.) TPDL 2018. LNCS, vol. 11057, pp. 220–226. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00066-0_19
Springmann, U., Reul, C., Dipper, S., Baiter, J.: Ground truth for training OCR engines on historical documents in German fraktur and early modern Latin. J. Lang. Technol. Comput. Linguist. 33(1), 97–114 (2018)
Stephens, Z.D., et al.: Big data: astronomical or Genomical? PLOS Biol. 13(7), 1–11 (2015). https://doi.org/10.1371/journal.pbio.1002195
Strien, D., Beelen, K., Coll Ardanuy, M., Hosseini, K., Mcgillivray, B., Colavizza, G.: Assessing the impact of OCR quality on downstream NLP tasks. SCITEPRESS-Sci. Technol. Publ., February 2020. https://doi.org/10.5220/0009169004840496
Tafti, A.P., Baghaie, A., Assefi, M., Arabnia, H.R., Yu, Z., Peissig, P.: OCR as a service: an experimental evaluation of google docs OCR, tesseract, ABBYY FineReader, and Transym. In: Bebis, G., et al. (eds.) ISVC 2016. LNCS, vol. 10072, pp. 735–746. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-50835-1_66
Urban, M.: An introduction to LATEX. TEX users group (1986)
Xue, L., et al.: ByT5: towards a token-free future with pre-trained byte-to-byte models. Trans. Assoc. Comput. Linguist. 10, 291–306 (2022). https://doi.org/10.1162/tacl_a_00461
Zaytsev, A.: Hathitrust and a mission for accessibility. J. Electron. 18(3) (2015)
Zharikov, I., Nikitin, F., Vasiliev, I., Dokholyan, V.: DDI-100: dataset for text detection and recognition. In: Proceedings of the 2020 4th International Symposium on Computer Science and Intelligent Control, pp. 1–5, November 2020. https://doi.org/10.1145/3440084.3441192, arXiv:1912.11658 [cs]
Zhu, W., Liu, Y., Hao, L.: A novel OCR approach based on document layout analysis and text block classification. In: 2016 12th International Conference on Computational Intelligence and Security (CIS), pp. 91–94, December 2016. https://doi.org/10.1109/CIS.2016.0029
Acknowledgments
This work is supported by a NASA Astrophysics Data Analysis Program Grant (20-ADAP20-0225).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Naiman, J.P., Cosillo, M.G., Williams, P.K.G., Goodman, A. (2023). Large Synthetic Data from the ar\(\mathrm {\chi }\)iv for OCR Post Correction of Historic Scientific Articles. In: Alonso, O., Cousijn, H., Silvello, G., Marrero, M., Teixeira Lopes, C., Marchesin, S. (eds) Linking Theory and Practice of Digital Libraries. TPDL 2023. Lecture Notes in Computer Science, vol 14241. Springer, Cham. https://doi.org/10.1007/978-3-031-43849-3_23
Download citation
DOI: https://doi.org/10.1007/978-3-031-43849-3_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43848-6
Online ISBN: 978-3-031-43849-3
eBook Packages: Computer ScienceComputer Science (R0)