skip to main content
10.1145/3404835.3463264acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
abstract

Embedding Formulae and Text for Improved Math Retrieval

Published:11 July 2021Publication History

ABSTRACT

Large data collections containing millions of math formulae are available online. Retrieving math expressions from these collections is challenging. The structural complexity of formulae requires specialized processing. When searching for mathematical content, accurate measures of formula similarity can help with tasks such as document ranking, query recommendation, and result set clustering. While there have been many attempts at embedding words and graphs, formula embedding is still in its early stages. This research aims to introduce an embedding model for mathematical formulae and accompanying text that can be used in math information retrieval. For that, first embedding models for isolated formulae are introduced, using intrinsic measures to study the effectiveness and efficiency of retrieval using those embeddings. Those results support the second goal of this research, which is to develop joint embedding models for formulae and text that can support the full range of content encountered in math retrieval. This can be seen as a special case of multimodal embedding, thus potentially benefiting from related research that jointly models other cases in which text and structured representations are co-present, such as chemistry. I summarize the research questions as follows: RQ1: How can we effectively provide an embedding model for isolated mathematical formulae? RQ2: How should the joint embedding of text and formulae be done? RQ3: How can evaluation of math search be grounded in a representative task? For RQ1, I propose to first study simple models that walk the tree structure to study the effectiveness and efficiency of the formula embedding model and then move to more advanced models. I have introduced Tangent-CFT [2] model. As my next step for formula embedding, I plan to look at deep neural network models that have been applied for graph embedding. After studying an embedding model for isolated formulae, in RQ2 I plan to focus on making use of the surrounding text of formulae. I will consider four possible approaches to constructing a joint embedding model: Linearizing the tree structure of formulae to sequences and then applying a single sequence embedding model to the text and the linearized formula, similar to [1], Forming separate embeddings for text and formulae, then unifying the two embedding spaces using seed alignments obtained either through supervision or using heuristics, or Extracting a tree out of the text and then apply a structure embedding model on both trees, or Combine results from specialized embedding models. For example, if the task is retrieval (ranking), then in the simplest scenario the results can be combined with methods such as Reciprocal Rank Fusion (RRF) or CombMNZ. I would then study how text and formulae embedding models should be combined. One possible solution might be to do retrieval using each of the embeddings and then combine the results. Another approach is to learn a model that provides a unified embedding that captures both formula and text features. Another approach to have a joint embedding model is to convert text to a tree structure. I can then look at this as a tree-to-tree translation problem. For both RQ1 and RQ2, I plan to first study the effectiveness of the proposed embedding in the formula retrieval before proceeding to the text+formula condition. Results will be compared with the best-reported results on the ARQMath [3] question answering task. While part of this research focuses on creating an embedding model for math, I also need a standard evaluation protocol and dataset. In a planned three-year sequence of ARQMath labs, I aim to answer RQ3 and provide high-quality training, devtest, and test sets for math search. Importantly, ARQMath also serves as a platform for operationalizing a repeatable community-consensus definition for relevance in isolated formula search.

References

  1. Kriste Krstovski and David M Blei. 2018. Equation embeddings. arXiv preprint arXiv:1803.09123 (2018).Google ScholarGoogle Scholar
  2. Behrooz Mansouri, Shaurya Rohatgi, Douglas W Oard, Jian Wu, C Lee Giles, and Richard Zanibbi. 2019. Tangent-CFT: An embedding model for mathematical formulas. In Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval. ACM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Richard Zanibbi, Douglas W Oard, Anurag Agarwal, and Behrooz Mansouri. 2020. Overview of ARQMath 2020: CLEF lab on answer retrieval for questions on math. In Experimental IR Meets Multilinguality, Multimodality, and Interaction Proceedings of the 11th International Conference of the CLEF Association. Springer.Google ScholarGoogle Scholar

Index Terms

  1. Embedding Formulae and Text for Improved Math Retrieval

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval
      July 2021
      2998 pages
      ISBN:9781450380379
      DOI:10.1145/3404835

      Copyright © 2021 Owner/Author

      Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 11 July 2021

      Check for updates

      Qualifiers

      • abstract

      Acceptance Rates

      Overall Acceptance Rate792of3,983submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader