Large Synthetic Data from the ar $$\mathrm {\chi }$$ iv for OCR Post Correction of Historic Scientific Articles

Naiman, J. P.; Cosillo, Morgan G.; Williams, Peter K. G.; Goodman, Alyssa

doi:10.1007/978-3-031-43849-3_23

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14241))

Included in the following conference series:

International Conference on Theory and Practice of Digital Libraries

391 Accesses

Abstract

Historical scientific articles often require Optical Character Recognition (OCR) to transform scanned documents into machine-readable text, a process that often produces errors. We present a pipeline for the generation of a synthetic ground truth/OCR dataset to correct the OCR results of the astrophysics literature holdings of the NASA Astrophysics Data System (ADS). By mining the ar$\mathrm {\chi }$iv we create, to the authors’ knowledge, the largest scientific synthetic ground truth/OCR post correction dataset of 203,354,393 character pairs. Baseline models trained with this dataset find the mean improvement in character and word error rates of 7.71% and 18.82% for historical OCR text, respectively. Interactive dashboards to explore the dataset are available online: https://readingtimemachine.github.io/projects/1-ocr-groundtruth-may2023, and data and code, are hosted on GitHub: https://github.com/ReadingTimeMachine/ocr_post_correction.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://ui.adsabs.harvard.edu/.
2.
For example, following the process in Sect. 3.3, TeXSoup finds errors in only
3.
https://readingtimemachine.github.io/projects/1-ocr-groundtruth-may2023.
4.
https://github.com/ReadingTimeMachine/ocr_post_correction.

References

ar$\rm {\chi }$iv bulk downloads. https://info.arxiv.org/help/bulk_data_s3.html. Accessed 05 March 2022
ar$\rm {\chi }$iv hiring and needs. https://info.arxiv.org/hiring/. Accessed 17 July 2023
Huggingface byt5-small. https://huggingface.co/google/byt5-small. Accessed 25 Mar 2023
Huggingface yelpfeast/byt5-base-english-ocr-correction. https://huggingface.co/yelpfeast/byt5-base-english-ocr-correction. Accessed 20 July 2023
The levenshtein package. https://github.com/maxbachmann/Levenshtein. Accessed 29 May 2023
Opendetex. https://github.com/pkubowicz/opendetex. Accessed 29 May 2023
The spacy sentence tokenizer. https://spacy.io/api/sentencizer. Accessed 29 May 2023
Texsoup. https://github.com/alvinwan/TexSoup. Accessed 30 Oct 2022
The tikzmark package. https://texdoc.org/serve/tikzmark/0. Accessed 29 May 2023
Accomazzi, A., et al.: Improved functionality and curation support in the ADS. In: American Astronomical Society Meeting Abstracts #225. American Astronomical Society Meeting Abstracts, vol. 225, pp. 336–355, January 2015
Google Scholar
Ahuja, A., Devera, A., Fox, E.A.: Parsing electronic theses and dissertations using object detection. In: Proceedings of the first Workshop on Information Extraction from Scientific Publications, pp. 121–130. Association for Computational Linguistics, November 2022. https://aclanthology.org/2022.wiesp-1.14
Boros, E., Nguyen, N.K., Lejeune, G., Doucet, A.: Assessing the impact of OCR noise on multilingual event detection over digitised documents. Int. J. Digit. Librar. 1–26 (2022). https://doi.org/10.1007/s00799-022-00325-2
Chiron, G., Doucet, A., Coustaty, M., Moreux, J.P.: ICDAR2017 2017 competition on post-OCR text correction. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 01, pp. 1423–1428 (2017). https://doi.org/10.1109/ICDAR.2017.232
Eichhorn, G., Accomazzi, A., Grant, C.S., Kurtz, M.J., Rey Bacaicoa, V., Murray, S.S.: New data and search features in the NASA ADS abstract service, p. 1298, March 2002. https://ui.adsabs.harvard.edu/abs/2002LPI....33.1298E, Conference Name: Lunar and Planetary Science Conference ADS Bibcode: 2002LPI....33.1298E
Etter, D., Rawls, S., Carpenter, C., Sell, G.: A synthetic recipe for OCR. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 864–869. IEEE, Sydney, Australia, September 2019. https://doi.org/10.1109/ICDAR.2019.00143
Evershed, J., Fitch, K.: Correcting noisy OCR: context beats confusion. In: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage, DATeCH 2014, pp. 45–51. Association for Computing Machinery, New York, NY, USA (2014). https://doi.org/10.1145/2595188.2595200
Ginev, D., Miller, B.R.: LaTeXML 2012 - a year of LaTeXML. In: Carette, J., Aspinall, D., Lange, C., Sojka, P., Windsteiger, W. (eds.) CICM 2013. LNCS (LNAI), vol. 7961, pp. 335–338. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-39320-4_24
Chapter Google Scholar
Honnibal, M., Montani, I.: spaCy 2: natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing. To appear 7(1), 411–420 (2017)
Google Scholar
Jiang, M., et al.: The gutenberg-hathitrust parallel corpus: a real-world dataset for noise investigation in uncorrected OCR texts. In: iConference 2021 (Poster) (2021)
Google Scholar
Kahu, S.Y.: Figure Extraction from Scanned Electronic Theses and Dissertations. Master’s thesis, Virginia Tech (2020)
Google Scholar
Krishnan, P., Jawahar, C.: Generating synthetic data for text recognition. arXiv preprint arXiv:1608.04224 (2016)
Le, T.A., Baydin, A.G., Zinkov, R., Wood, F.: Using synthetic data to train neural networks is model-based reasoning. In: 2017 International Joint Conference on Neural Networks (IJCNN), pp. 3514–3521. IEEE (2017)
Google Scholar
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Soviet Phys. Doklady 10, 707 (1966)
MathSciNet Google Scholar
Li, M., Cui, L., Huang, S., Wei, F., Zhou, M., Li, Z.: TableBank: a benchmark dataset for table detection and recognition, July 2020. http://arxiv.org/abs/1903.01949, arXiv:1903.01949 [cs]
Li, M., et al.: DocBank: a benchmark dataset for document layout analysis. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 949–960 (2020)
Google Scholar
Liu, Y., et al.: Multilingual denoising pre-training for neural machine translation. Trans. Assoc. Comput. Linguist. 8, 726–742 (2020). https://doi.org/10.1162/tacl_a_00343
Article Google Scholar
Maheshwari, A., Singh, N., Krishna, A., Ramakrishnan, G.: A Benchmark and dataset for Post-OCR text correction in Sanskrit, November 2022. https://doi.org/10.48550/arXiv.2211.07980, arXiv:2211.07980 [cs]
Mayernik, M.S., Hart, D.L., Maull, K.E., Weber, N.M.: Assessing and tracing the outcomes and impact of research infrastructures. J. Assoc. Inf. Sci. Technol. 68(6), 1341–1359 (2017). https://doi.org/10.1002/asi.23721
Article Google Scholar
Naiman, J.P., Williams, P.K., Goodman, A.: The digitization of historical astrophysical literature with highly localized figures and figure captions. Int. J. Digit. Librar. 1–21 (2023). https://doi.org/10.1007/s00799-023-00350-9
Naiman, J.P., Williams, P.K.G., Goodman, A.: Figure and figure caption extraction for mixed raster and vector PDFs: digitization of astronomical literature with OCR features. In: Silvello, G., et al. Linking Theory and Practice of Digital Libraries. TPDL 2022. Lecture Notes in Computer Science, vol. 13541, pp 52–67. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-16802-4_5
Nguyen, T.T.H., Jatowt, A., Coustaty, M., Nguyen, N.V., Doucet, A.: Deep statistical analysis of OCR errors for effective post-OCR processing. In: 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp. 29–38, June 2019. https://doi.org/10.1109/JCDL.2019.00015
Pepe, A., Goodman, A., Muench, A.: The ADS all-sky survey. In: Ballester, P., Egret, D., Lorente, N.P.F. (eds.) Astronomical Data Analysis Software and Systems XXI. Astronomical Society of the Pacific Conference Series, vol. 461, p. 275, September 2012
Google Scholar
Pfahler, L., Morik, K.: Self-supervised pretraining of graph neural network for the retrieval of related mathematical expressions in scientific articles, August 2022. http://arxiv.org/abs/2209.00446, arXiv:2209.00446 [cs]
Ramirez-Orta, J.A., Xamena, E., Maguitman, A., Milios, E., Soto, A.J.: Post-OCR document correction with large ensembles of character sequence-to-sequence models. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 11192–11199 (2022)
Google Scholar
Ren, X., Chen, K., Sun, J.: A CNN based scene Chinese text recognition algorithm with synthetic data engine. arXiv e-prints arXiv:1604.01891, https://doi.org/10.48550/arXiv.1604.01891, April 2016
Rigaud, C., Doucet, A., Coustaty, M., Moreux, J.P.: ICDAR 2019 competition on post-OCR text correction (2019)
Google Scholar
Rigaud, C., Doucet, A., Coustaty, M., Moreux, J.P.: ICDAR 2019 competition on post-OCR text correction. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1588–1593 (2019). https://doi.org/10.1109/ICDAR.2019.00255
Saier, T., Färber, M.: Bibliometric-enhanced arxiv: a data set for paper-based and citation-based tasks. In: BIR@ ECIR, pp. 14–26 (2019)
Google Scholar
Saier, T., Krause, J., Färber, M.: unarXive 2022: all arXiv publications pre-processed for NLP, including structured full-text and citation network. arXiv e-prints arXiv:2303.14957, https://doi.org/10.48550/arXiv.2303.14957. March 2023
Sandy, H.M., et al.: Making a case for open research: implications for reproducibility and transparency. Proc. Assoc. Inf. Sci. Technol. 54(1), 583–586 (2017). https://doi.org/10.1002/pra2.2017.14505401079
Article Google Scholar
Schmitt-Koopmann, F.M., Huang, E.M., Darvishy, A.: Accessible PDFs: applying artificial intelligence for automated remediation of STEM PDFs. In: Proceedings of the 24th International ACM SIGACCESS Conference on Computers and Accessibility, ASSETS 2022, pp. 1–6. Association for Computing Machinery, New York, NY, USA, October 2022. https://doi.org/10.1145/3517428.3550407
Smith, L., Arcand, K., Smith, R., Bookbinder, J., Smith, J.: Capturing the many faces of an exploded star: communicating complex and evolving astronomical data. JCOM J. Sci. Commun. 16, 16050202 (2017). https://doi.org/10.22323/2.16050202
Smith, R.: An overview of the tesseract OCR engine. In: Proceedings of the Ninth International Conference on Document Analysis and Recognition, vol. 02, ICDAR 2007, pp. 629–633. IEEE Computer Society, USA (2007)
Google Scholar
Sohmen, L., Charbonnier, J., Blümel, I., Wartena, C., Heller, L.: Figures in scientific open access publications. In: Méndez, E., Crestani, F., Ribeiro, C., David, G., Lopes, J.C. (eds.) TPDL 2018. LNCS, vol. 11057, pp. 220–226. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00066-0_19
Chapter Google Scholar
Springmann, U., Reul, C., Dipper, S., Baiter, J.: Ground truth for training OCR engines on historical documents in German fraktur and early modern Latin. J. Lang. Technol. Comput. Linguist. 33(1), 97–114 (2018)
Article Google Scholar
Stephens, Z.D., et al.: Big data: astronomical or Genomical? PLOS Biol. 13(7), 1–11 (2015). https://doi.org/10.1371/journal.pbio.1002195
Strien, D., Beelen, K., Coll Ardanuy, M., Hosseini, K., Mcgillivray, B., Colavizza, G.: Assessing the impact of OCR quality on downstream NLP tasks. SCITEPRESS-Sci. Technol. Publ., February 2020. https://doi.org/10.5220/0009169004840496
Tafti, A.P., Baghaie, A., Assefi, M., Arabnia, H.R., Yu, Z., Peissig, P.: OCR as a service: an experimental evaluation of google docs OCR, tesseract, ABBYY FineReader, and Transym. In: Bebis, G., et al. (eds.) ISVC 2016. LNCS, vol. 10072, pp. 735–746. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-50835-1_66
Chapter Google Scholar
Urban, M.: An introduction to LATEX. TEX users group (1986)
Google Scholar
Xue, L., et al.: ByT5: towards a token-free future with pre-trained byte-to-byte models. Trans. Assoc. Comput. Linguist. 10, 291–306 (2022). https://doi.org/10.1162/tacl_a_00461
Zaytsev, A.: Hathitrust and a mission for accessibility. J. Electron. 18(3) (2015)
Google Scholar
Zharikov, I., Nikitin, F., Vasiliev, I., Dokholyan, V.: DDI-100: dataset for text detection and recognition. In: Proceedings of the 2020 4th International Symposium on Computer Science and Intelligent Control, pp. 1–5, November 2020. https://doi.org/10.1145/3440084.3441192, arXiv:1912.11658 [cs]
Zhu, W., Liu, Y., Hao, L.: A novel OCR approach based on document layout analysis and text block classification. In: 2016 12th International Conference on Computational Intelligence and Security (CIS), pp. 91–94, December 2016. https://doi.org/10.1109/CIS.2016.0029

Download references

Acknowledgments

This work is supported by a NASA Astrophysics Data Analysis Program Grant (20-ADAP20-0225).

Author information

Authors and Affiliations

School of Information Sciences, University of Illinois, Urbana-Champaign, 61820, USA
J. P. Naiman & Morgan G. Cosillo
Harvard-Smithsonian Center for Astrophysics, Cambridge, 02138, USA
Peter K. G. Williams & Alyssa Goodman

Authors

J. P. Naiman
View author publications
You can also search for this author in PubMed Google Scholar
Morgan G. Cosillo
View author publications
You can also search for this author in PubMed Google Scholar
Peter K. G. Williams
View author publications
You can also search for this author in PubMed Google Scholar
Alyssa Goodman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to J. P. Naiman .

Editor information

Editors and Affiliations

Amazon, Santa Clara, CA, USA
Omar Alonso
DataCite, Hannover, Germany
Helena Cousijn
University of Padua, Padua, Italy
Gianmaria Silvello
Europeana Foundation, BE Den Haag, The Netherlands
Mónica Marrero
University of Porto, Porto, Portugal
Carla Teixeira Lopes
University of Padua, Padua, Italy
Stefano Marchesin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Naiman, J.P., Cosillo, M.G., Williams, P.K.G., Goodman, A. (2023). Large Synthetic Data from the ar$\mathrm {\chi }$iv for OCR Post Correction of Historic Scientific Articles. In: Alonso, O., Cousijn, H., Silvello, G., Marrero, M., Teixeira Lopes, C., Marchesin, S. (eds) Linking Theory and Practice of Digital Libraries. TPDL 2023. Lecture Notes in Computer Science, vol 14241. Springer, Cham. https://doi.org/10.1007/978-3-031-43849-3_23

Download citation

DOI: https://doi.org/10.1007/978-3-031-43849-3_23
Published: 22 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43848-6
Online ISBN: 978-3-031-43849-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Large Synthetic Data from the ar\(\mathrm {\chi }\)iv for OCR Post Correction of Historic Scientific Articles

Abstract

Access this chapter

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Large Synthetic Data from the ar\(\mathrm {\chi }\)iv for OCR Post Correction of Historic Scientific Articles

Abstract

Access this chapter

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation