Abstract
An important initial step of mathematical formula recognition is to correctly identify the location of formulae within documents. Previous work in this area has traditionally focused on image-based documents; however, given the prevalence and popularity of the PDF format for dissemination, alternatives to image-based approaches are increasingly being explored. In this paper, we investigate the use of both machine learning techniques and heuristic rules to locate the boundaries of both isolated and embedded formulae within documents, based upon data extracted directly from PDF files. We propose four new features along with preprocessing and post-processing techniques for isolated formula identification. Furthermore, we compare, analyse and extensively tune nine state-of-the-art learning algorithms for a comprehensive evaluation of our proposed methods. The evaluation is carried out over a ground-truth dataset, which we have made publicly available, together with an application adaptable fine-grained evaluation metric. Our experimental results demonstrate that the overall accuracies of isolated and embedded formula identification are increased by 11.52 and 10.65 %, compared with our previously proposed formula identification approach.





Similar content being viewed by others
References
Anderson, R.H.: Syntax-directed recognition of hand-printed two-dimensional mathematics. PhD thesis, Harvard University, Cambridge, Massachusetts (1968)
Lin, X., Gao, L., Tang, Z., Lin, X., Hu, X.: Mathematical formula identification in PDF documents. In: International Conference on Document Analysis and Recognition (ICDAR), pp. 1419–1423. IEEE (2011)
Lin, X., Gao, L., Tang, Z., Hu, X., Lin, X.: Identification of embedded mathematical formulas in PDF documents using SVM. In: Document Recognition and Retrieval (DRR) XIX, pp. 8297 0D 1–8 (2012)
Lin, X., Gao, L., Tang, Z., Lin, X., Hu, X.: Performance evaluation of mathematical formula identification. In: The 10th IAPR International Workshop on Document Analysis Systems (DAS), pp. 287–291. IEEE (2012)
Adobe. PDF reference, 7th edition (2008)
Zanibbi, R., Blostein, D.: Recognition and retrieval of mathematical expressions. Int. J. Document Anal. Recognit. pp. 1–27 (2011)
Baker, J.B.: A linear grammar approach for the analysis of mathematical documents. PhD thesis, University of Birmingham (2012)
Rahman, F., Alam, H.: Conversion of PDF documents into HTML: a case study of document image analysis. In: Conference Record of the Thirty-Seventh Asilomar Conference on Signals, Systems and Computers, vol. 1, pp. 87–91. IEEE (2003)
Baker, J.B., Sexton, A.P., Sorge, V.: A linear grammar approach to mathematical formula recognition from PDF. In: Proceedings of the 8th International Conference on Mathematical Knowledge Management, vol. 5625 of LNAI, pp. 201–216. Springer (2009)
Fateman, R.J., Tokuyasu, T., Berman, B.P., Mitchell, N.: Optical character recognition and parsing of typeset mathematics. J. Vis. Commun. Image Represent. 7(1), 2–15 (1996)
Lee, H.J., Wang, J.S.: Design of a mathematical expression understanding system. Pattern Recognit. Lett. 18(3), 289–298 (1997)
Toumit, J.Y., Garcia-Salicetti, S., Emptoz, H.: A hierarchical and recursive model of mathematical expressions for automatic reading of mathematical documents. In: Proceedings of the Fifth International Conference on Document Analysis and Recognition (ICDAR), pp. 119–122. IEEE (1999)
Garain, U., Chaudhuri, B.B.: A syntactic approach for processing mathematical expressions in printed documents. In: Proceedings of the 15th International Conference on Pattern Recognition (ICPR), vol. 4, pp. 523–526. IEEE (2000)
Kacem, A., Belaïd, A., Ben Ahmed, M.: Automatic extraction of printed mathematical formulas using fuzzy logic and propagation of context. Int. J. Document Anal. Recognit. 4(2), 97–108 (2001)
Inoue, K., Miyazaki, R., Suzuki, M.: Optical recognition of printed mathematical documents. In: Proceedings of the Third Asian Technology Conference on Mathematics, pp. 280–289 (1998)
Suzuki, M., Tamari, F., Fukuda, R., Uchida, S., Kanahori, T.: INFTY: an integrated OCR system for mathematical documents. In: Proceedings of the 2003 ACM Symposium on Document Engineering, pp. 95–104. ACM (2003)
Baker, J.B., Sexton, A.P., Sorge, V.: Towards reverse engineering of PDF documents. In: Towards a Digital Mathematics Library, pp. 65–75. Masaryk University Press (2011)
Lee, H.J., Wang, J.S.: Design of a mathematical expression recognition system. In: Proceedings of the Third International Conference on Document Analysis and Recognition (ICDAR), vol. 2, pp. 1084–1087. IEEE (1995)
Chowdhury, S.P., Mandal, S., Das, A.K., Chanda, B.: Automated segmentation of math-zones from document images. In: 7th International Conference on Document Analysis and Recognition (ICDAR), vol. 2, pp. 755–759 (2003)
Chang, T.Y., Takiguchi, Y., Okada, M.: Physical structure segmentation with projection profile for mathematic formulae and graphics in academic paper images. In: The Ninth International Conference on Document Analysis and Recognition (ICDAR), vol. 2, pp. 1193–1197. IEEE (2007)
Garain, U., Chaudhuri, B.B., Chaudhuri, A.R.: Identification of embedded mathematical expressions in scanned documents. In: Proceedings of the 17th International Conference on Pattern Recognition (ICPR), vol. 1, pp. 384–387. IEEE (2004)
Garain, U.: Identification of mathematical expressions in document images. In: International Conference on Document Analysis and Recognition (ICDAR), pp. 1340–1344. IEEE (2009)
Jin, J., Han, X., Wang, Q.: Mathematical formulas extraction. In: International Conference on Document Analysis and Recognition (ICDAR), pp. 1138–1141. IEEE (2003)
Drake, D.M., Baird, H.S.: Distinguishing mathematics notation from English text using computational geometry. In: Proceedings. Eighth International Conference on Document Analysis and Recognition (ICDAR), pp. 1270–1274. IEEE (2005)
Liu, Y., Bai, K., Gao, L.: An efficient pre-processing method to identify logical components from PDF documents. Adv. Knowl. Discov. Data Min. pp. 500–511 (2011)
Uchida, S., Nomura, A., Suzuki, M.: Quantitative analysis of mathematical documents. Int. J. Document Anal. Recognit. 7(4), 211–218 (2005)
Phillips, I., Chanda, B., Haralick, R: University of Washington UW-III English technical document image database (1996)
http://ntcir-math.nii.ac.jp/ (2013)
Gao, L., Tang, Z., Lin, X., Qiu, R.: Comprehensive global typography extraction system for electronic book documents. In: The Eighth IAPR International Workshop on Document Analysis Systems (DAS), pp. 615–621. IEEE (2008)
Bishop, C.M.: Pattern Recognition and Machine Learning, vol. 4. Springer, New York (2006)
Chawla, N.V., Japkowicz, N., Kotcz, A.: Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explor. Newsl. 6(1), 1–6 (2004)
Batista, G.E., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor. Newsl. 6(1), 20–29 (2004)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Kim, S.H., Jeong, C.B., Kwag, H.K., Suen, C.Y.: Word segmentation of printed text lines based on gap clustering and special symbol detection. In: Proceedings. 16th International Conference on Pattern Recognition (ICPR), vol. 2, pp. 320–323. IEEE (2002)
Chang, C.C., Lin, C.J.: Libsvm: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2(3), 27 (2011)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)
Acknowledgments
This work is sponsored by the National Natural Science Foundation of China (No. 61202232) and National Key Technology R&D Program of China (No. 2012BAH40F01). We would like to thank our colleagues Jing Fang, Yongtao Wang and Luyuan Li for their comments on this paper. The learning algorithms in our paper are implemented by LibSVM and weka, which are open source software providing implementations of machine learning algorithms.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Lin, X., Gao, L., Tang, Z. et al. Mathematical formula identification and performance evaluation in PDF documents. IJDAR 17, 239–255 (2014). https://doi.org/10.1007/s10032-013-0216-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10032-013-0216-1