Paper
23 January 2012 Identification of embedded mathematical formulas in PDF documents using SVM
Author Affiliations +
Proceedings Volume 8297, Document Recognition and Retrieval XIX; 82970D (2012) https://doi.org/10.1117/12.912445
Event: IS&T/SPIE Electronic Imaging, 2012, Burlingame, California, United States
Abstract
With the tremendous popularity of PDF format, recognizing mathematical formulas in PDF documents becomes a new and important problem in document analysis field. In this paper, we present a method of embedded mathematical formula identification in PDF documents, based on Support Vector Machine (SVM). The method first segments text lines into words, and then classifies each word into two classes, namely formula or ordinary text. Various features of embedded formulas, including geometric layout, character and context content, are utilized to build a robust and adaptable SVM classifier. Embedded formulas are then extracted through merging the words labeled as formulas. Experimental results show good performance of the proposed method. Furthermore, the method has been successfully incorporated into a commercial software package for large-scale e-Book production.
© (2012) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Xiaoyan Lin, Liangcai Gao, Zhi Tang, Xuan Hu, and Xiaofan Lin "Identification of embedded mathematical formulas in PDF documents using SVM", Proc. SPIE 8297, Document Recognition and Retrieval XIX, 82970D (23 January 2012); https://doi.org/10.1117/12.912445
Lens.org Logo
CITATIONS
Cited by 14 scholarly publications.
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Mathematics

Feature extraction

Image segmentation

Binary data

Data modeling

Associative arrays

Machine learning

Back to Top