abstract

Embedding Formulae and Text for Improved Math Retrieval

Author:
Behrooz Mansouri

Rochester Institute of Technology, Rochester, NY, USA

Rochester Institute of Technology, Rochester, NY, USA
View Profile

SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information RetrievalJuly 2021Pages 2700https://doi.org/10.1145/3404835.3463264

Published:11 July 2021Publication History

SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 2700

ABSTRACT

Large data collections containing millions of math formulae are available online. Retrieving math expressions from these collections is challenging. The structural complexity of formulae requires specialized processing. When searching for mathematical content, accurate measures of formula similarity can help with tasks such as document ranking, query recommendation, and result set clustering. While there have been many attempts at embedding words and graphs, formula embedding is still in its early stages. This research aims to introduce an embedding model for mathematical formulae and accompanying text that can be used in math information retrieval. For that, first embedding models for isolated formulae are introduced, using intrinsic measures to study the effectiveness and efficiency of retrieval using those embeddings. Those results support the second goal of this research, which is to develop joint embedding models for formulae and text that can support the full range of content encountered in math retrieval. This can be seen as a special case of multimodal embedding, thus potentially benefiting from related research that jointly models other cases in which text and structured representations are co-present, such as chemistry. I summarize the research questions as follows: RQ1: How can we effectively provide an embedding model for isolated mathematical formulae? RQ2: How should the joint embedding of text and formulae be done? RQ3: How can evaluation of math search be grounded in a representative task? For RQ1, I propose to first study simple models that walk the tree structure to study the effectiveness and efficiency of the formula embedding model and then move to more advanced models. I have introduced Tangent-CFT [2] model. As my next step for formula embedding, I plan to look at deep neural network models that have been applied for graph embedding. After studying an embedding model for isolated formulae, in RQ2 I plan to focus on making use of the surrounding text of formulae. I will consider four possible approaches to constructing a joint embedding model: Linearizing the tree structure of formulae to sequences and then applying a single sequence embedding model to the text and the linearized formula, similar to [1], Forming separate embeddings for text and formulae, then unifying the two embedding spaces using seed alignments obtained either through supervision or using heuristics, or Extracting a tree out of the text and then apply a structure embedding model on both trees, or Combine results from specialized embedding models. For example, if the task is retrieval (ranking), then in the simplest scenario the results can be combined with methods such as Reciprocal Rank Fusion (RRF) or CombMNZ. I would then study how text and formulae embedding models should be combined. One possible solution might be to do retrieval using each of the embeddings and then combine the results. Another approach is to learn a model that provides a unified embedding that captures both formula and text features. Another approach to have a joint embedding model is to convert text to a tree structure. I can then look at this as a tree-to-tree translation problem. For both RQ1 and RQ2, I plan to first study the effectiveness of the proposed embedding in the formula retrieval before proceeding to the text+formula condition. Results will be compared with the best-reported results on the ARQMath [3] question answering task. While part of this research focuses on creating an embedding model for math, I also need a standard evaluation protocol and dataset. In a planned three-year sequence of ARQMath labs, I aim to answer RQ3 and provide high-quality training, devtest, and test sets for math search. Importantly, ARQMath also serves as a platform for operationalizing a repeatable community-consensus definition for relevance in isolated formula search.

References

Kriste Krstovski and David M Blei. 2018. Equation embeddings. arXiv preprint arXiv:1803.09123 (2018).Google Scholar
Behrooz Mansouri, Shaurya Rohatgi, Douglas W Oard, Jian Wu, C Lee Giles, and Richard Zanibbi. 2019. Tangent-CFT: An embedding model for mathematical formulas. In Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval. ACM.Google ScholarDigital Library
Richard Zanibbi, Douglas W Oard, Anurag Agarwal, and Behrooz Mansouri. 2020. Overview of ARQMath 2020: CLEF lab on answer retrieval for questions on math. In Experimental IR Meets Multilinguality, Multimodality, and Interaction Proceedings of the 11th International Conference of the CLEF Association. Springer.Google Scholar

Index Terms

Embedding Formulae and Text for Improved Math Retrieval
1. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval

Recommendations

Learning to Rank for Mathematical Formula Retrieval
SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

In Mathematical Information Retrieval (MIR), formulae can be used in a query to match other similar formulae in documents. However, due to the structural complexity of formulae, specialized processing is needed for formula matching. Formulae may be ...
Read More
Contextualized Formula Search Using Math Abstract Meaning Representation
CIKM '22: Proceedings of the 31st ACM International Conference on Information & Knowledge Management

In math formula search, relevance is determined not only by the similarity of formulas in isolation, but also by their surrounding context. We introduce MathAMR, a new unified representation for sentences containing math. MathAMR generalizes Abstract ...
Read More
Leveraging Formulae and Text for Improved Math Retrieval
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval
July 2021
2998 pages
ISBN:9781450380379
DOI:10.1145/3404835
General Chairs:
Fernando Diaz
(Google)
,
Chirag Shah
University of Washington
,
Torsten Suel
New York University
,
Program Chairs:
Pablo Castells
Universidad Autónoma de Madrid, Amazon
,
Rosie Jones
Spotify
,
Tetsuya Sakai
Waseda University
Copyright © 2021 Owner/Author
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 July 2021
Check for updates
Author Tags
formula retrieval
math IR
multimodal embedding
Qualifiers
- abstract
Conference

Acceptance Rates
Overall Acceptance Rate792of3,983submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 178
  Total Downloads
- Downloads (Last 12 months)24
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Embedding Formulae and Text for Improved Math Retrieval

SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Learning to Rank for Mathematical Formula Retrieval

Contextualized Formula Search Using Math Abstract Meaning Representation

Leveraging Formulae and Text for Improved Math Retrieval