Skip to main content

Large Synthetic Data from the ar\(\mathrm {\chi }\)iv for OCR Post Correction of Historic Scientific Articles

  • Conference paper
  • First Online:
Linking Theory and Practice of Digital Libraries (TPDL 2023)

Abstract

Historical scientific articles often require Optical Character Recognition (OCR) to transform scanned documents into machine-readable text, a process that often produces errors. We present a pipeline for the generation of a synthetic ground truth/OCR dataset to correct the OCR results of the astrophysics literature holdings of the NASA Astrophysics Data System (ADS). By mining the ar\(\mathrm {\chi }\)iv we create, to the authors’ knowledge, the largest scientific synthetic ground truth/OCR post correction dataset of 203,354,393 character pairs. Baseline models trained with this dataset find the mean improvement in character and word error rates of 7.71% and 18.82% for historical OCR text, respectively. Interactive dashboards to explore the dataset are available online: https://readingtimemachine.github.io/projects/1-ocr-groundtruth-may2023, and data and code, are hosted on GitHub: https://github.com/ReadingTimeMachine/ocr_post_correction.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 54.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://ui.adsabs.harvard.edu/.

  2. 2.

    For example, following the process in Sect. 3.3, TeXSoup finds errors in only

  3. 3.

    https://readingtimemachine.github.io/projects/1-ocr-groundtruth-may2023.

  4. 4.

    https://github.com/ReadingTimeMachine/ocr_post_correction.

References

  1. ar\(\rm {\chi }\)iv bulk downloads. https://info.arxiv.org/help/bulk_data_s3.html. Accessed 05 March 2022

  2. ar\(\rm {\chi }\)iv hiring and needs. https://info.arxiv.org/hiring/. Accessed 17 July 2023

  3. Huggingface byt5-small. https://huggingface.co/google/byt5-small. Accessed 25 Mar 2023

  4. Huggingface yelpfeast/byt5-base-english-ocr-correction. https://huggingface.co/yelpfeast/byt5-base-english-ocr-correction. Accessed 20 July 2023

  5. The levenshtein package. https://github.com/maxbachmann/Levenshtein. Accessed 29 May 2023

  6. Opendetex. https://github.com/pkubowicz/opendetex. Accessed 29 May 2023

  7. The spacy sentence tokenizer. https://spacy.io/api/sentencizer. Accessed 29 May 2023

  8. Texsoup. https://github.com/alvinwan/TexSoup. Accessed 30 Oct 2022

  9. The tikzmark package. https://texdoc.org/serve/tikzmark/0. Accessed 29 May 2023

  10. Accomazzi, A., et al.: Improved functionality and curation support in the ADS. In: American Astronomical Society Meeting Abstracts #225. American Astronomical Society Meeting Abstracts, vol. 225, pp. 336–355, January 2015

    Google Scholar 

  11. Ahuja, A., Devera, A., Fox, E.A.: Parsing electronic theses and dissertations using object detection. In: Proceedings of the first Workshop on Information Extraction from Scientific Publications, pp. 121–130. Association for Computational Linguistics, November 2022. https://aclanthology.org/2022.wiesp-1.14

  12. Boros, E., Nguyen, N.K., Lejeune, G., Doucet, A.: Assessing the impact of OCR noise on multilingual event detection over digitised documents. Int. J. Digit. Librar. 1–26 (2022). https://doi.org/10.1007/s00799-022-00325-2

  13. Chiron, G., Doucet, A., Coustaty, M., Moreux, J.P.: ICDAR2017 2017 competition on post-OCR text correction. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 01, pp. 1423–1428 (2017). https://doi.org/10.1109/ICDAR.2017.232

  14. Eichhorn, G., Accomazzi, A., Grant, C.S., Kurtz, M.J., Rey Bacaicoa, V., Murray, S.S.: New data and search features in the NASA ADS abstract service, p. 1298, March 2002. https://ui.adsabs.harvard.edu/abs/2002LPI....33.1298E, Conference Name: Lunar and Planetary Science Conference ADS Bibcode: 2002LPI....33.1298E

  15. Etter, D., Rawls, S., Carpenter, C., Sell, G.: A synthetic recipe for OCR. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 864–869. IEEE, Sydney, Australia, September 2019. https://doi.org/10.1109/ICDAR.2019.00143

  16. Evershed, J., Fitch, K.: Correcting noisy OCR: context beats confusion. In: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage, DATeCH 2014, pp. 45–51. Association for Computing Machinery, New York, NY, USA (2014). https://doi.org/10.1145/2595188.2595200

  17. Ginev, D., Miller, B.R.: LaTeXML 2012 - a year of LaTeXML. In: Carette, J., Aspinall, D., Lange, C., Sojka, P., Windsteiger, W. (eds.) CICM 2013. LNCS (LNAI), vol. 7961, pp. 335–338. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-39320-4_24

    Chapter  Google Scholar 

  18. Honnibal, M., Montani, I.: spaCy 2: natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing. To appear 7(1), 411–420 (2017)

    Google Scholar 

  19. Jiang, M., et al.: The gutenberg-hathitrust parallel corpus: a real-world dataset for noise investigation in uncorrected OCR texts. In: iConference 2021 (Poster) (2021)

    Google Scholar 

  20. Kahu, S.Y.: Figure Extraction from Scanned Electronic Theses and Dissertations. Master’s thesis, Virginia Tech (2020)

    Google Scholar 

  21. Krishnan, P., Jawahar, C.: Generating synthetic data for text recognition. arXiv preprint arXiv:1608.04224 (2016)

  22. Le, T.A., Baydin, A.G., Zinkov, R., Wood, F.: Using synthetic data to train neural networks is model-based reasoning. In: 2017 International Joint Conference on Neural Networks (IJCNN), pp. 3514–3521. IEEE (2017)

    Google Scholar 

  23. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Soviet Phys. Doklady 10, 707 (1966)

    MathSciNet  Google Scholar 

  24. Li, M., Cui, L., Huang, S., Wei, F., Zhou, M., Li, Z.: TableBank: a benchmark dataset for table detection and recognition, July 2020. http://arxiv.org/abs/1903.01949, arXiv:1903.01949 [cs]

  25. Li, M., et al.: DocBank: a benchmark dataset for document layout analysis. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 949–960 (2020)

    Google Scholar 

  26. Liu, Y., et al.: Multilingual denoising pre-training for neural machine translation. Trans. Assoc. Comput. Linguist. 8, 726–742 (2020). https://doi.org/10.1162/tacl_a_00343

    Article  Google Scholar 

  27. Maheshwari, A., Singh, N., Krishna, A., Ramakrishnan, G.: A Benchmark and dataset for Post-OCR text correction in Sanskrit, November 2022. https://doi.org/10.48550/arXiv.2211.07980, arXiv:2211.07980 [cs]

  28. Mayernik, M.S., Hart, D.L., Maull, K.E., Weber, N.M.: Assessing and tracing the outcomes and impact of research infrastructures. J. Assoc. Inf. Sci. Technol. 68(6), 1341–1359 (2017). https://doi.org/10.1002/asi.23721

    Article  Google Scholar 

  29. Naiman, J.P., Williams, P.K., Goodman, A.: The digitization of historical astrophysical literature with highly localized figures and figure captions. Int. J. Digit. Librar. 1–21 (2023). https://doi.org/10.1007/s00799-023-00350-9

  30. Naiman, J.P., Williams, P.K.G., Goodman, A.: Figure and figure caption extraction for mixed raster and vector PDFs: digitization of astronomical literature with OCR features. In: Silvello, G., et al. Linking Theory and Practice of Digital Libraries. TPDL 2022. Lecture Notes in Computer Science, vol. 13541, pp 52–67. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-16802-4_5

  31. Nguyen, T.T.H., Jatowt, A., Coustaty, M., Nguyen, N.V., Doucet, A.: Deep statistical analysis of OCR errors for effective post-OCR processing. In: 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp. 29–38, June 2019. https://doi.org/10.1109/JCDL.2019.00015

  32. Pepe, A., Goodman, A., Muench, A.: The ADS all-sky survey. In: Ballester, P., Egret, D., Lorente, N.P.F. (eds.) Astronomical Data Analysis Software and Systems XXI. Astronomical Society of the Pacific Conference Series, vol. 461, p. 275, September 2012

    Google Scholar 

  33. Pfahler, L., Morik, K.: Self-supervised pretraining of graph neural network for the retrieval of related mathematical expressions in scientific articles, August 2022. http://arxiv.org/abs/2209.00446, arXiv:2209.00446 [cs]

  34. Ramirez-Orta, J.A., Xamena, E., Maguitman, A., Milios, E., Soto, A.J.: Post-OCR document correction with large ensembles of character sequence-to-sequence models. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 11192–11199 (2022)

    Google Scholar 

  35. Ren, X., Chen, K., Sun, J.: A CNN based scene Chinese text recognition algorithm with synthetic data engine. arXiv e-prints arXiv:1604.01891, https://doi.org/10.48550/arXiv.1604.01891, April 2016

  36. Rigaud, C., Doucet, A., Coustaty, M., Moreux, J.P.: ICDAR 2019 competition on post-OCR text correction (2019)

    Google Scholar 

  37. Rigaud, C., Doucet, A., Coustaty, M., Moreux, J.P.: ICDAR 2019 competition on post-OCR text correction. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1588–1593 (2019). https://doi.org/10.1109/ICDAR.2019.00255

  38. Saier, T., Färber, M.: Bibliometric-enhanced arxiv: a data set for paper-based and citation-based tasks. In: BIR@ ECIR, pp. 14–26 (2019)

    Google Scholar 

  39. Saier, T., Krause, J., Färber, M.: unarXive 2022: all arXiv publications pre-processed for NLP, including structured full-text and citation network. arXiv e-prints arXiv:2303.14957, https://doi.org/10.48550/arXiv.2303.14957. March 2023

  40. Sandy, H.M., et al.: Making a case for open research: implications for reproducibility and transparency. Proc. Assoc. Inf. Sci. Technol. 54(1), 583–586 (2017). https://doi.org/10.1002/pra2.2017.14505401079

    Article  Google Scholar 

  41. Schmitt-Koopmann, F.M., Huang, E.M., Darvishy, A.: Accessible PDFs: applying artificial intelligence for automated remediation of STEM PDFs. In: Proceedings of the 24th International ACM SIGACCESS Conference on Computers and Accessibility, ASSETS 2022, pp. 1–6. Association for Computing Machinery, New York, NY, USA, October 2022. https://doi.org/10.1145/3517428.3550407

  42. Smith, L., Arcand, K., Smith, R., Bookbinder, J., Smith, J.: Capturing the many faces of an exploded star: communicating complex and evolving astronomical data. JCOM J. Sci. Commun. 16, 16050202 (2017). https://doi.org/10.22323/2.16050202

  43. Smith, R.: An overview of the tesseract OCR engine. In: Proceedings of the Ninth International Conference on Document Analysis and Recognition, vol. 02, ICDAR 2007, pp. 629–633. IEEE Computer Society, USA (2007)

    Google Scholar 

  44. Sohmen, L., Charbonnier, J., Blümel, I., Wartena, C., Heller, L.: Figures in scientific open access publications. In: Méndez, E., Crestani, F., Ribeiro, C., David, G., Lopes, J.C. (eds.) TPDL 2018. LNCS, vol. 11057, pp. 220–226. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00066-0_19

    Chapter  Google Scholar 

  45. Springmann, U., Reul, C., Dipper, S., Baiter, J.: Ground truth for training OCR engines on historical documents in German fraktur and early modern Latin. J. Lang. Technol. Comput. Linguist. 33(1), 97–114 (2018)

    Article  Google Scholar 

  46. Stephens, Z.D., et al.: Big data: astronomical or Genomical? PLOS Biol. 13(7), 1–11 (2015). https://doi.org/10.1371/journal.pbio.1002195

  47. Strien, D., Beelen, K., Coll Ardanuy, M., Hosseini, K., Mcgillivray, B., Colavizza, G.: Assessing the impact of OCR quality on downstream NLP tasks. SCITEPRESS-Sci. Technol. Publ., February 2020. https://doi.org/10.5220/0009169004840496

  48. Tafti, A.P., Baghaie, A., Assefi, M., Arabnia, H.R., Yu, Z., Peissig, P.: OCR as a service: an experimental evaluation of google docs OCR, tesseract, ABBYY FineReader, and Transym. In: Bebis, G., et al. (eds.) ISVC 2016. LNCS, vol. 10072, pp. 735–746. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-50835-1_66

    Chapter  Google Scholar 

  49. Urban, M.: An introduction to LATEX. TEX users group (1986)

    Google Scholar 

  50. Xue, L., et al.: ByT5: towards a token-free future with pre-trained byte-to-byte models. Trans. Assoc. Comput. Linguist. 10, 291–306 (2022). https://doi.org/10.1162/tacl_a_00461

  51. Zaytsev, A.: Hathitrust and a mission for accessibility. J. Electron. 18(3) (2015)

    Google Scholar 

  52. Zharikov, I., Nikitin, F., Vasiliev, I., Dokholyan, V.: DDI-100: dataset for text detection and recognition. In: Proceedings of the 2020 4th International Symposium on Computer Science and Intelligent Control, pp. 1–5, November 2020. https://doi.org/10.1145/3440084.3441192, arXiv:1912.11658 [cs]

  53. Zhu, W., Liu, Y., Hao, L.: A novel OCR approach based on document layout analysis and text block classification. In: 2016 12th International Conference on Computational Intelligence and Security (CIS), pp. 91–94, December 2016. https://doi.org/10.1109/CIS.2016.0029

Download references

Acknowledgments

This work is supported by a NASA Astrophysics Data Analysis Program Grant (20-ADAP20-0225).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to J. P. Naiman .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Naiman, J.P., Cosillo, M.G., Williams, P.K.G., Goodman, A. (2023). Large Synthetic Data from the ar\(\mathrm {\chi }\)iv for OCR Post Correction of Historic Scientific Articles. In: Alonso, O., Cousijn, H., Silvello, G., Marrero, M., Teixeira Lopes, C., Marchesin, S. (eds) Linking Theory and Practice of Digital Libraries. TPDL 2023. Lecture Notes in Computer Science, vol 14241. Springer, Cham. https://doi.org/10.1007/978-3-031-43849-3_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-43849-3_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-43848-6

  • Online ISBN: 978-3-031-43849-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics