skip to main content
10.1145/3342558.3345426acmconferencesArticle/Chapter ViewAbstractPublication PagesdocengConference Proceedingsconference-collections
short-paper

Semi-Automatic LaTeX-Based Labeling of Mathematical Objects in PDF Documents: MOP Data Set

Published: 23 September 2019 Publication History

Abstract

Mathematical objects (MO) in PDF documents is paramount in understanding the ontology and mathematical essence in published science, technology, engineering, and mathematical (STEM) documents. As of now, Marmot is the only publicly available data set for optimizing and evaluating MO labeling models in PDF documents. Thus, this paper proposes a semiautomatic labeling MO algorithm that uses PDF documents and their corresponding LaTeX source files to generate a new data set consisting of MO bounding boxes (Bbox) in PDF documents, their LaTeX equation, topic, and subject. The first step in labeling each MO is to transform the LaTeX and PDF documents into a string format. Afterwards, a shortest unique string-matching technique is proposed to align PDF pages with LaTeX files. On each page, a similar shortest string-matching technique is employed to align each LaTeX MO with its PDF counterpart. Once an MO is located, the PDF and LaTeX MOs are normalized in order to match symbols between their LaTeX and PDF representations. A number of filtering rules are set to eliminate matches that are considered exceedingly inconsistent. Matches that pass these rules will have their MOs highlighted for final manual inspection. A total of 1,802 pages in the high energy physics (hep-th) field were labelled.1

References

[1]
David Easdown. "Syntactic and Semantic reasoning in mathematics teaching and learning". School of Mathematics and Statistics, University of Sydney, NSW, 2006, Australia
[2]
Senjuti Basu Roy, Martine De Cock, Vani Mandava, Swapna Savanna, Brian Dalessandro, Claudia Perlich, William Cukierski, and Ben Hamner. "The Microsoft academic search dataset and kdd cup 2013." Proceedings of the 2013 KDD cup 2013 workshop. ACM, 2013.
[3]
Xing Wang, "MECA: Mathematical Expression Based Post Publication Content Analysis". Doctoral dissertation, (2018).
[4]
Xiaoyan Lin, Liangcai Gao, Zhi Tang, Josef Baker, and Volker Sorge. "Mathematical formula identification and performance evaluation in PDF doucments". Spriner-Verlag Berline Heidelberg 2013, IJDAR 2014. DOI 10.1007/s10032-013-0216-1.
[5]
X. Wang. "Missing MEs in Marmot dataset". Aug. 2017. [Online]. Available: www.icst.pku.edu.cn/cpdp/data/marmot_data.html [Accessed April 9, 2019].
[6]
MathWorks. 2019. [Online]. Available: https://www.mathworks.com/help/driving/ref/groundtruthlabeler-app.html [Accessed April 9, 2019].
[7]
Amazon SageMaker. [Online]. Available: https://docs.aws.amazon.com/sagemaker/latest/dg/sms-automated-labeling.html [Accessed April 9, 2019].
[8]
LaTeXML.[Online].Available: https://dlmf.nist.gov/LaTeXML/ [Accessed April 9, 2019].
[9]
Apache PDFBox. [Online]. Available: https://pdfbox.apache.org/download.cgi [Accessed April 9, 2019].
[10]
Donald Beyette, Michael S. Rugh, Jason Lin, Xing Wang, Zelun Wang, Jyh-Charn Liu, and Robert M. Capraro. "DIME: A Dynamic Interactive Mathematical Expression Tool for STEM Education". 126th Annual Conference and Exposition ASEE. Paper ID: 25558
[11]
Zelun Wang, Donald Beyette, Jason Lin, and Jyh-Charn Liu. "Extraction of Math Expressions from PDF Documents based on Unsupervised Modeling of Fonts." 2019 15th IAPR International Conference on Document Analysis and Recognition (ICDAR) (In press)
[12]
arXiv. [Online]. Available: https://arxiv.order/ [Accessed May 28, 209]
[13]
Margarita Cabrera-Bean, Carles Diaz-Vilor, and Josep Vidal. "Impact of noisy annotators' reliability in a crowdsourcing system performance". In 24th European Signal Processing Conference (EUSIPCO) 2016.
[14]
Pei-Yun Hsueh, Prem Melville, Vikas Sindhwani. "Data Quality from Crowdsourcing: A Study of Annotation Selection Criteria". In Proceedings of the NAACL HLT Workshop on Active Learning for Natural Language Processing, pages 27--35, Boulder, Colorado, June 2009.
[15]
Yuntian Deng, Anssi Kanervisto, Jeffrey Ling, and Alexander M. Rush. "Image-to-Markup Generation with Coarse-to-Fine Attention". In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017.

Index Terms

  1. Semi-Automatic LaTeX-Based Labeling of Mathematical Objects in PDF Documents: MOP Data Set

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    DocEng '19: Proceedings of the ACM Symposium on Document Engineering 2019
    September 2019
    254 pages
    ISBN:9781450368872
    DOI:10.1145/3342558
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    In-Cooperation

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 23 September 2019

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. LaTeX
    2. Mathematical object
    3. PDF
    4. ground truth
    5. ontology
    6. semi-automatic labeling

    Qualifiers

    • Short-paper
    • Research
    • Refereed limited

    Conference

    DocEng '19
    Sponsor:
    DocEng '19: ACM Symposium on Document Engineering 2019
    September 23 - 26, 2019
    Berlin, Germany

    Acceptance Rates

    DocEng '19 Paper Acceptance Rate 30 of 77 submissions, 39%;
    Overall Acceptance Rate 194 of 564 submissions, 34%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 145
      Total Downloads
    • Downloads (Last 12 months)7
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 03 Mar 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media