Skip to main content

A Math Formula Extraction and Evaluation Framework for PDF Documents

  • Conference paper
  • First Online:
Document Analysis and Recognition – ICDAR 2021 (ICDAR 2021)

Abstract

We present a processing pipeline for math formula extraction in PDF documents that takes advantage of character information in born-digital PDFs (e.g., created using or Word). Our pipeline is designed for indexing math in technical document collections to support math-aware search engines capable of processing queries containing keywords and formulas. The system includes user-friendly tools for visualizing recognition results in HTML pages. Our pipeline is comprised of a new state-of-the-art PDF character extractor that identifies precise bounding boxes for non-Latin symbols, a novel Single Shot Detector-based formula detector, and an existing graph-based formula parser (QD-GGA) for recognizing formula structure. To simplify analyzing structure recognition errors, we have extended the LgEval library (from the CROHME competitions) to allow viewing all instances of specific errors by clicking on HTML links. Our source code is publicly available.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.cs.rit.edu/~dprl/software.html.

  2. 2.

    You can see the various ways we obtain the em square value by looking at the getBoundingBox() method in our BoundingBox class.

  3. 3.

    https://zenodo.org/record/3483048#.XaCwmOdKjVo.

References

  1. Apache: PDFBOX - a Java PDF library. https://pdfbox.apache.org/

  2. Baker, J.B., Sexton, A.P., Sorge, V.: A linear grammar approach to mathematical formula recognition from PDF. In: Carette, J., Dixon, L., Coen, C.S., Watt, S.M. (eds.) CICM 2009. LNCS (LNAI), vol. 5625, pp. 201–216. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-02614-0_19

    Chapter  Google Scholar 

  3. Davila, K., Joshi, R., Setlur, S., Govindaraju, V., Zanibbi, R.: Tangent-V: math formula image search using line-of-sight graphs. In: Azzopardi, L., Stein, B., Fuhr, N., Mayr, P., Hauff, C., Hiemstra, D. (eds.) ECIR 2019. LNCS, vol. 11437, pp. 681–695. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-15712-8_44

    Chapter  Google Scholar 

  4. Deng, Y., Kanervisto, A., Ling, J., Rush, A.M.: Image-to-markup generation with coarse-to-fine attention. In: Precup, D., Teh, Y.W. (eds.) Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6–11 August 2017. Proceedings of Machine Learning Research, vol. 70, pp. 980–989. PMLR (2017). http://proceedings.mlr.press/v70/deng17a.html

  5. Edmonds, J.: Optimum branchings. J. Res. Nat. Bureau Stand. Sect. B Math. Math. Phys. 71B(4), 233 (1967). https://doi.org/10.6028/jres.071B.032. https://nvlpubs.nist.gov/nistpubs/jres/71B/jresv71Bn4p233_A1b.pdf

  6. Liao, M., Shi, B., Bai, X., Wang, X., Liu, W.: TextBoxes: A fast text detector with a single deep neural network. In: Thirty-First AAAI Conference on Artificial Intelligence (2017)

    Google Scholar 

  7. Lin, X., Gao, L., Tang, Z., Lin, X., Hu, X.: Mathematical formula identification in PDF documents. In: 2011 International Conference on Document Analysis and Recognition, pp. 1419–1423, September 2011. https://doi.org/10.1109/ICDAR.2011.285. iSSN: 2379-2140

  8. Liu, W., et al.: SSD: single shot MultiBox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2

    Chapter  Google Scholar 

  9. Mahdavi, M., Sun, L., Zanibbi, R.: Visual parsing with query-driven global graph attention (QD-GGA): preliminary results for handwritten math formula recognition. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 2429–2438. IEEE, Seattle, June 2020. https://doi.org/10.1109/CVPRW50498.2020.00293. https://ieeexplore.ieee.org/document/9150860/

  10. Mahdavi, M., Zanibbi, R., Mouchere, H., Viard-Gaudin, C., Garain, U.: ICDAR 2019 CROHME + TFD: Competition on recognition of handwritten mathematical expressions and typeset formula detection. in: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1533–1538. IEEE, Sydney, September 2019. https://doi.org/10.1109/ICDAR.2019.00247. https://ieeexplore.ieee.org/document/8978036/

  11. Mali, P., Kukkadapu, P., Mahdavi, M., Zanibbi, R.: ScanSSD: Scanning single shot detector for mathematical formulas in PDF document images. arXiv:2003.08005 [cs], March 2020

  12. Mouchère, H., Viard-Gaudin, C., Zanibbi, R., Garain, U.: ICFHR2016 CROHME: Competition on recognition of online handwritten mathematical expressions. In: 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 607–612, October 2016. https://doi.org/10.1109/ICFHR.2016.0116. iSSN: 2167-6445

  13. Mouchère, H., Viard-Gaudin, C., Zanibbi, R., Garain, U., Kim, D.H., Kim, J.H.: ICDAR 2013 CROHME: Third international competition on recognition of online handwritten mathematical expressions. In: 2013 12th International Conference on Document Analysis and Recognition, pp. 1428–1432, August 2013. https://doi.org/10.1109/ICDAR.2013.288. iSSN: 2379-2140

  14. Mouchère, H., Zanibbi, R., Garain, U., Viard-Gaudin, C.: Advancing the state of the art for handwritten math recognition: The CROHME competitions, 2011–2014. Int. J. Doc. Anal. Recogn. (IJDAR) 19(2), 173–189 (2016). https://doi.org/10.1007/s10032-016-0263-5

  15. Phong, B.H., Dat, L.T., Yen, N.T., Hoang, T.M., Le, T.L.: A deep learning based system for mathematical expression detection and recognition in document images. In: 2020 12th International Conference on Knowledge and Systems Engineering (KSE), pp. 85–90, November 2020. https://doi.org/10.1109/KSE50997.2020.9287693. iSSN: 2164-2508

  16. Sorge, V., Bansal, A., Jadhav, N.M., Garg, H., Verma, A., Balakrishnan, M.: Towards generating web-accessible STEM documents from PDF. In: Proceedings of the 17th International Web for All Conference, W4A 2020, pp. 1–5. Association for Computing Machinery, New York, April 2020. https://doi.org/10.1145/3371300.3383351

  17. Suzuki, M., Uchida, S., Nomura, A.: A ground-truthed mathematical character and symbol image database. In: Eighth International Conference on Document Analysis and Recognition (ICDAR 2005), vol. 2, pp. 675–679 (2005). https://doi.org/10.1109/ICDAR.2005.14

  18. Suzuki, M., Tamari, F., Fukuda, R., Uchida, S., Kanahori, T.: INFTY: an integrated OCR system for mathematical documents. In: Proceedings of the 2003 ACM Symposium on Document Engineering, DocEng 2003, pp. 95–104. Association for Computing Machinery, New York, November 2003. https://doi.org/10.1145/958220.958239

  19. Suzuki, M., Yamaguchi, K.: Recognition of E-Born PDF including mathematical formulas. In: Miesenberger, K., Bühler, C., Penaz, P. (eds.) ICCHP 2016. LNCS, vol. 9758, pp. 35–42. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-41264-1_5

    Chapter  Google Scholar 

  20. Zanibbi, R., Pillay, A., Mouchere, H., Viard-Gaudin, C., Blostein, D.: Stroke-based performance metrics for handwritten mathematical expressions. In: 2011 International Conference on Document Analysis and Recognition, pp. 334–338, September 2011. https://doi.org/10.1109/ICDAR.2011.75. iSSN: 2379-2140

  21. Zanibbi, R., Blostein, D.: Recognition and retrieval of mathematical expressions. Int. J. Doc. Anal. Recogn. (IJDAR) 15(4), 331–357 (2012). https://doi.org/10.1007/s10032-011-0174-4

  22. Zanibbi, R., Mouchère, H., Viard-Gaudin, C.: Evaluating structural pattern recognition for handwritten math via primitive label graphs. In: Document Recognition and Retrieval XX, vol. 8658, p. 865817. International Society for Optics and Photonics, February 2013. https://doi.org/10.1117/12.2008409

  23. Zanibbi, R., Orakwue, A.: Math search for the masses: Multimodal search interfaces and appearance-based retrieval. In: Kerber, M., Carette, J., Kaliszyk, C., Rabe, F., Sorge, V. (eds.) CICM 2015. LNCS (LNAI), vol. 9150, pp. 18–36. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-20615-8_2

    Chapter  Google Scholar 

  24. Zhang, X., Gao, L., Yuan, K., Liu, R., Jiang, Z., Tang, Z.: A symbol dominance based formulae recognition approach for PDF documents. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 01, pp. 1144–1149, November 2017. https://doi.org/10.1109/ICDAR.2017.189. iSSN: 2379-2140

Download references

Acknowledgements

A sincere thanks to all the students who contributed to the pipeline’s development: R. Joshi, P. Mali, P. Kukkadapu, A. Keller, M. Mahdavi and J. Diehl. Jian Wu provided the document collected used to evaluate SymbolScraper. This material is based upon work supported by the Alfred P. Sloan Foundation under Grant No. G-2017-9827 and the National Science Foundation (USA) under Grant Nos. IIS-1717997 (MathSeer project) and 2019897 (MMLI project).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Richard Zanibbi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Shah, A.K., Dey, A., Zanibbi, R. (2021). A Math Formula Extraction and Evaluation Framework for PDF Documents. In: Lladós, J., Lopresti, D., Uchida, S. (eds) Document Analysis and Recognition – ICDAR 2021. ICDAR 2021. Lecture Notes in Computer Science(), vol 12822. Springer, Cham. https://doi.org/10.1007/978-3-030-86331-9_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-86331-9_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-86330-2

  • Online ISBN: 978-3-030-86331-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics