skip to main content
10.1145/3539618.3591803acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
short-paper

Searching the ACL Anthology with Math Formulas and Text

Published: 18 July 2023 Publication History

Abstract

Mathematical notation is a key analytical resource for science and technology. Unfortunately, current math-aware search engines require LATEX or template palettes to construct formulas, which can be challenging for non-experts. Also, their indexed collections are primarily web pages where formulas are represented explicitly in machine-readable formats (e.g., LATEX, Presentation MathML). The new MathDeck system searches PDF documents in a portion of the ACL Anthology using both formulas and text, and shows matched words and formulas along with other extracted formulas in-context. In PDF, formulas are not demarcated: a new indexing module extracts formulas using PDF vector graphics information and computer vision techniques. For non-expert users and visual editing, a central design feature of MathDeck's interface is formula 'chips' usable in formula creation, search, reuse, and annotation with titles and descriptions in cards. For experts, LATEX is supported in the text query box and the visual formula editor. MathDeck is open-source, and our demo is available online.

References

[1]
Jon Almazán, Albert Gordo, Alicia Fornés, and Ernest Valveny. 2014. Word Spotting and Recognition with Embedded Attributes. IEEE Trans. Pattern Anal. Mach. Intell., Vol. 36, 12 (2014), 2552--2566. https://doi.org/10.1109/TPAMI.2014.2339814
[2]
Josef B. Baker, Alan P. Sexton, and Volker Sorge. 2010. Faithful mathematical formula recognition from PDF documents. In Document Analysis Systems (ACM International Conference Proceeding Series). ACM, 485--492.
[3]
Hui Chao. 2003. Graphics extraction in a PDF document. In Document Recognition and Retrieval X, Tapas Kanungo, Elisa H. Barney Smith, Jianying Hu, and Paul B. Kantor (Eds.), Vol. 5010. International Society for Optics and Photonics, SPIE, 317--325. https://doi.org/10.1117/12.479683
[4]
Christopher Clark and Santosh Divvala. 2016. PDFFigures 2.0: Mining Figures from Research Papers. (2016).
[5]
Kenny Davila and Richard Zanibbi. 2018. Visual Search Engine for Handwritten and Typeset Math in Lecture Videos and LATEX Notes. In ICFHR. IEEE Computer Society, 50--55.
[6]
Abhisek Dey and Richard Zanibbi. 2021. ScanSSD-XYc: faster detection for math formulas. In Document Analysis and Recognition-ICDAR 2021 Workshops: Lausanne, Switzerland, September 5-10, 2021, Proceedings, Part I 16. Springer, 91--96.
[7]
Yancarlos Diaz, Gavin Nishizawa, Behrooz Mansouri, Kenny Davila, and Richard Zanibbi. 2021. The MathDeck Formula Editor: Interactive Formula Entry Combining LaTeX, Structure Editing, and Search. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems (Yokohama, Japan) (CHI EA '21). Association for Computing Machinery, New York, NY, USA, Article 192, 5 pages. https://doi.org/10.1145/3411763.3451564
[8]
Deborah Ferreira, Marco Valentino, Andre Freitas, Sean Welleck, and Moritz Schubotz (Eds.). 2022. Proceedings of the 1st Workshop on Mathematical Natural Language Processing (MathNLP). Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (Hybrid). https://aclanthology.org/2022.mathnlp-1.0
[9]
Ferruccio Guidi and Claudio Sacerdoti Coen. 2016. A Survey on Retrieval of Mathematical Knowledge. Math. Comput. Sci., Vol. 10, 4 (2016), 409--427.
[10]
Michael Kohlhase, Bogdan A Matican, and Corneliu-Claudiu Prodescu. 2012. Mathwebsearch 0.5: Scaling an open formula search engine. In Intelligent Computer Mathematics: 11th International Conference, AISC 2012, 19th Symposium, Calculemus 2012, 5th International Workshop, DML 2012, 11th International Conference, MKM 2012, Systems and Projects, Held as Part of CICM 2012, Bremen, Germany, July 8-13, 2012. Proceedings 5. Springer, 342--357.
[11]
Matt Langsenkamp, Behrooz Mansouri, and Richard Zanibbi. 2022. Expanding Spatial Regions and Incorporating IDF for PHOC-Based Math Formula Retrieval at ARQMath-3. Proc. CLEF 2022 (CEUR Working Notes) (2022).
[12]
Mahshad Mahdavi and Richard Zanibbi. 2020. Visual parsing with query-driven global graph attention (QD-GGA): preliminary results for handwritten math formula recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 570--571.
[13]
Behrooz Mansouri, Vít Novotný, Anurag Agarwal, Douglas W. Oard, and Richard Zanibbi. 2022a. Overview of ARQMath-3 (2022): Third CLEF Lab on Answer Retrieval for Questions on Math (Working Notes Version). In CLEF (Working Notes) (CEUR Workshop Proceedings, Vol. 3180). CEUR-WS.org, 1--27.
[14]
Behrooz Mansouri, Douglas W. Oard, and Richard Zanibbi. 2022b. Contextualized Formula Search Using Math Abstract Meaning Representation. In CIKM. ACM, 4329--4333.
[15]
Jordan Meadows and André Freitas. 2022. A Survey in Mathematical Language Processing. CoRR, Vol. abs/2205.15231 (2022).
[16]
Bruce R Miller and Abdou Youssef. 2003. Technical aspects of the digital library of mathematical functions. Annals of Mathematics and Artificial Intelligence, Vol. 38, 1 (2003), 121--136.
[17]
Gavin Nishizawa, Jennifer Liu, Yancarlos Diaz, Abishai Dmello, Wei Zhong, and Richard Zanibbi. 2020. MathSeer: A Math-Aware Search Interface with Intuitive Formula Editing, Reuse, and Lookup. In Advances in Information Retrieval, Joemon M. Jose, Emine Yilmaz, Jo ao Magalh aes, Pablo Castells, Nicola Ferro, Mário J. Silva, and Flávio Martins (Eds.). Springer International Publishing, Cham, 470--475.
[18]
Ricardo M Oliveira, Flavio B Gonzaga, Valmir C Barbosa, and Geraldo B Xexéo. 2017. A distributed system for SearchOnMath based on the Microsoft BizSpark program. arXiv preprint arXiv:1711.04189 (2017).
[19]
Piotr Adam Praczyk and Javier Nogueras-Iso. 2013. Automatic extraction of figures from scientific publications in high-energy physics. Information Technology and Libraries, Vol. 32, 4 (2013), 25--52.
[20]
Ayush Kumar Shah, Abhisek Dey, and Richard Zanibbi. 2021. A Math Formula Extraction and Evaluation Framework for PDF Documents. In Document Analysis and Recognition--ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5-10, 2021, Proceedings, Part II 16. Springer, 19--34.
[21]
Volker Sorge, Akashdeep Bansal, Neha M. Jadhav, Himanshu Garg, Ayushi Verma, and Meenakshi Balakrishnan. 2020. Towards generating web-accessible STEM documents from PDF. In W4A. ACM, 19:1--19:5.
[22]
Keita Del Valle Wangari, Richard Zanibbi, and Anurag Agarwal. 2014. Discovering real-world use cases for a multimodal math search interface. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval. 947--950.
[23]
Richard Zanibbi and Dorothea Blostein. 2012. Recognition and retrieval of mathematical expressions. International Journal on Document Analysis and Recognition (IJDAR), Vol. 15, 4 (2012), 331--357.
[24]
Wei Zhong, Yuqing Xie, and Jimmy Lin. 2022. Applying Structural and Dense Semantic Matching for the ARQMath Lab 2022, CLEF. Proceedings of the Working Notes of CLEF 2022 (2022), 5--8.

Cited By

View all
  • (2024)ChemScraper: leveraging PDF graphics instructions for molecular diagram parsingInternational Journal on Document Analysis and Recognition10.1007/s10032-024-00486-727:3(395-414)Online publication date: 5-Jul-2024
  • (2023)Line-of-Sight with Graph Attention Parser (LGAP) for Math FormulasDocument Analysis and Recognition - ICDAR 202310.1007/978-3-031-41734-4_25(401-419)Online publication date: 21-Aug-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval
July 2023
3567 pages
ISBN:9781450394086
DOI:10.1145/3539618
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 July 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. latex
  2. math-aware search
  3. mathematical information retrieval (mir)
  4. multimodal retrieval
  5. pdf

Qualifiers

  • Short-paper

Funding Sources

Conference

SIGIR '23
Sponsor:

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)37
  • Downloads (Last 6 weeks)8
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)ChemScraper: leveraging PDF graphics instructions for molecular diagram parsingInternational Journal on Document Analysis and Recognition10.1007/s10032-024-00486-727:3(395-414)Online publication date: 5-Jul-2024
  • (2023)Line-of-Sight with Graph Attention Parser (LGAP) for Math FormulasDocument Analysis and Recognition - ICDAR 202310.1007/978-3-031-41734-4_25(401-419)Online publication date: 21-Aug-2023

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media