Identification of embedded mathematical formulas in PDF documents using SVM

Xiaoyan Lin; Liangcai Gao; Zhi Tang; Xuan Hu; Xiaofan Lin

doi:10.1117/12.912445

23 January 2012 Identification of embedded mathematical formulas in PDF documents using SVM

Xiaoyan Lin, Liangcai Gao, Zhi Tang, Xuan Hu, Xiaofan Lin

Proceedings Volume 8297, Document Recognition and Retrieval XIX; 82970D (2012) https://doi.org/10.1117/12.912445
Event: IS&T/SPIE Electronic Imaging, 2012, Burlingame, California, United States

Abstract

With the tremendous popularity of PDF format, recognizing mathematical formulas in PDF documents becomes a new and important problem in document analysis field. In this paper, we present a method of embedded mathematical formula identification in PDF documents, based on Support Vector Machine (SVM). The method first segments text lines into words, and then classifies each word into two classes, namely formula or ordinary text. Various features of embedded formulas, including geometric layout, character and context content, are utilized to build a robust and adaptable SVM classifier. Embedded formulas are then extracted through merging the words labeled as formulas. Experimental results show good performance of the proposed method. Furthermore, the method has been successfully incorporated into a commercial software package for large-scale e-Book production.

Citation Download Citation

Xiaoyan Lin, Liangcai Gao, Zhi Tang, Xuan Hu, and Xiaofan Lin "Identification of embedded mathematical formulas in PDF documents using SVM", Proc. SPIE 8297, Document Recognition and Retrieval XIX, 82970D (23 January 2012); https://doi.org/10.1117/12.912445

ACCESS THE FULL ARTICLE

INSTITUTIONAL
Select your institution to access the SPIE Digital Library.

SELECT YOUR INSTITUTION

PERSONAL
Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.

PERSONAL SIGN IN

No SPIE Account? Create one

PURCHASE THIS CONTENT

SUBSCRIBE TO DIGITAL LIBRARY

50 downloads per 1-year subscription

Members: $195

Non-members: $335 ADD TO CART

25 downloads per 1 - year subscription

Members: $145

Non-members: $250 ADD TO CART

PURCHASE SINGLE ARTICLE

Includes PDF, HTML & Video, when available