skip to main content
10.1145/3404835.3462956acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Learning to Rank for Mathematical Formula Retrieval

Published: 11 July 2021 Publication History

Abstract

In Mathematical Information Retrieval (MIR), formulae can be used in a query to match other similar formulae in documents. However, due to the structural complexity of formulae, specialized processing is needed for formula matching. Formulae may be represented by their appearance in Symbol Layout Trees (SLTs) or by their syntax in Operator Trees (OPTs). Previous approaches for formula retrieval used one or both of these representations and used unification to improve search results for inexact matches (e.g., allowing different variable names to match). On these representations, models for matching full expressions (trees), subexpressions, and paths have been used. Recently embedding models were used to represent formulae as vectors. In this paper, the effectiveness of retrieval models and formula representations are studied to identify their relative strengths and weaknesses. Then, a learning to rank model is proposed, using SVM-rank over similarity scores from different formula retrieval models as features. Experiments on the ARQMath formula retrieval task results show that the proposed learning to rank model is effective, producing new state-of-the-art results.

Supplementary Material

MP4 File (zoom_0.mp4)
In Mathematical Information Retrieval (MIR), formulae can be used in a query to match other similar formulae in documents. However, due to the structural complexity of formulae, specialized processing is needed for formula matching. Formulae may be represented by their appearance in Symbol Layout Trees (SLTs) or by their syntax in Operator Trees (OPTs). Previous approaches for formula retrieval used one or both of these representations and used unification to improve search results for inexact matches (e.g., allowing different variable names to match). On these representations, models for matching full expressions (trees), subexpressions, and paths have been used. Recently embedding models were used to represent formulae as vectors. This video presents the paper "Learning to Rank for Mathematical Formula Retrieval". In this paper, the effectiveness of retrieval models and formula representations are studied to identify their relative strengths and weaknesses.

References

[1]
Akiko Aizawa and Michael Kohlhase. 2021. Mathematical Information Retrieval. In Evaluating Information Retrieval and Access Tasks. Springer, Singapore, 169--185.
[2]
Akiko Aizawa, Michael Kohlhase, and Iadh Ounis. 2013. NTCIR-10 Math Pilot Task Overview. In Proceedings of the 10th NTCIR Conference on Evaluation of Information Access Technologies, NTCIR-10, National Center of Sciences, Tokyo, Japan, June 18--21, 2013, Noriko Kando and Tsuneaki Kato (Eds.). National Institute of Informatics (NII). http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings10/pdf/NTCIR/OVERVIEW/01-NTCIR10-OV-MATH-AizawaA.pdf
[3]
Akiko Aizawa, Michael Kohlhase, Iadh Ounis, and Moritz Schubotz. 2014. NTCIR-11 Math-2 Task Overview. In In Proceedings of the 11th NTCIR Conference.
[4]
Pankaj Dadure, Partha Pakray, and Sivaji Bandyopadhyay. 2020. An Analysis of Variable-Size Vector Based Approach for Formula Searching. (2020).
[5]
Kenny Davila and Richard Zanibbi. 2017. Layout and semantics: Combining representations for mathematical formula search. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval.
[6]
Liangcai Gao, Zhuoren Jiang, Yue Yin, Ke Yuan, Zuoyu Yan, and Zhi Tang. 2017. Preliminary Exploration of Formula Embedding for Mathematical Information Retrieval: can mathematical formulae be embedded like a natural language?
[7]
Xuan Hu, Liangcai Gao, Xiaoyan Lin, Zhi Tang, Xiaofan Lin, and Josef B Baker. 2013. WikiMirs: a mathematical information retrieval system for wikipedia. In Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries. 11--20.
[8]
Thorsten Joachims. 2006. Training linear SVMs in linear time. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. 217--226.
[9]
Shahab Kamali and Frank Wm Tompa. 2013. Retrieving documents with mathematical content. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval. ACM, 353--362.
[10]
Giovanni Yoko Kristianto, Goran Topic, and Akiko Aizawa. 2016. MCAT Math Retrieval System for NTCIR-12 MathIR Task. In NTCIR.
[11]
Kriste Krstovski and David M Blei. 2018. Equation Embeddings.
[12]
Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In International Conference on Machine Learning.
[13]
Xiaoyan Lin, Liangcai Gao, Xuan Hu, Zhi Tang, Yingnan Xiao, and Xiaozhong Liu. 2014. A mathematics retrieval system for formulae in layout presentations. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval. 697--706.
[14]
Behrooz Mansouri, Anurag Agarwal, Douglas Oard, and Richard Zanibbi. 2021. Advancing Math-Aware Search: The ARQMath-2 Lab at CLEF 2021. In European Conference on Information Retrieval. Springer.
[15]
Behrooz Mansouri, Douglas W Oard, and Richard Zanibbi. 2020. DPRL Systems in the CLEF 2020 ARQMath Lab. In Working Notes of CLEF 2020-Conference and Labs of the Evaluation Forum.
[16]
Behrooz Mansouri, Shaurya Rohatgi, Douglas W Oard, Jian Wu, C Lee Giles, and Richard Zanibbi. 2019 a. Tangent-CFT: An embedding model for mathematical formulas. In Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval. ACM, 11--18.
[17]
Behrooz Mansouri, Richard Zanibbi, and Douglas W Oard. 2019 b. Characterizing Searches for Mathematical Concepts. In 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL). IEEE, 57--66.
[18]
Gavin Nishizawa, Jennifer Liu, Yancarlos Diaz, Abishai Dmello, Wei Zhong, and Richard Zanibbi. 2020. MathSeer: A Math-Aware Search Interface with Intuitive Formula Editing, Reuse, and Lookup. In European Conference on Information Retrieval. Springer, 470--475.
[19]
Immanuel Normann and Michael Kohlhase. 2007. Extended formula normalization for $varepsilon$-retrieval and sharing of mathematical knowledge. In Towards Mechanized Mathematical Assistants. Springer, 356--370.
[20]
V'it Novotnỳ, Petr Sojka, Michal Stefánik, and Dávid Lupták. 2020. Three is better than one. In CEUR Workshop Proceedings. Thessaloniki, Greece.
[21]
Ricardo M Oliveira, Flavio B Gonzaga, Valmir C Barbosa, and Geraldo B Xexéo. 2017. A distributed system for SearchOnMath based on the Microsoft BizSpark program. arXiv preprint arXiv:1711.04189 (2017).
[22]
Amarnath Pathak, Partha Pakray, and Alexander Gelbukh. 2018. A formula embedding approach to math information retrieval. Computación y Sistemas, Vol. 22, 3 (2018), 819--833.
[23]
Amarnath Pathak, Partha Pakray, Sandip Sarkar, Dipankar Das, and Alexander Gelbukh. 2017. Mathirs: Retrieval system for scientific documents. Computación y Sistemas, Vol. 21, 2 (2017), 253--265.
[24]
Mateusz Pawlik and Nikolaus Augsten. 2016. Tree edit distance: Robust and memory-efficient. Information Systems, Vol. 56 (2016), 157--173.
[25]
Lukas Pfahler and Katharina Morik. 2020. Semantic Search in Millions of Equations. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 135--143.
[26]
Tetsuya Sakai. 2007. Alternatives to bpref. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. 71--78.
[27]
Mohammed Shatnawi and Abdou Youssef. 2007. Equivalence detection using parse-tree normalization for math search. In 2007 2nd International Conference on Digital Information Management, Vol. 2. IEEE, 643--648.
[28]
Petr Sojka and Martin Livska. 2011. The art of mathematics retrieval. In Proceedings of the 11th ACM Symposium on Document Engineering.
[29]
Abhinav Thanda, Ankit Agarwal, Kushal Singla, Aditya Prakash, and Abhishek Gupta. 2016. A Document Retrieval System for Math Queries. In NTCIR.
[30]
Richard Zanibbi, Akiko Aizawa, Michael Kohlhase, Iadh Ounis, Goran Topic, and Kenny Davila. 2016a. NTCIR-12 MathIR Task Overview. In NTCIR.
[31]
Richard Zanibbi and Dorothea Blostein. 2012. Recognition and retrieval of mathematical expressions. International Journal on Document Analysis and Recognition (IJDAR).
[32]
Richard Zanibbi, Kenny Davila, Andrew Kane, and Frank Wm Tompa. 2016b. Multi-stage math formula search: Using appearance-based similarity metrics at scale. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval.
[33]
Richard Zanibbi, Douglas W Oard, Anurag Agarwal, and Behrooz Mansouri. 2020. Overview of ARQMath 2020: CLEF lab on answer retrieval for questions on math. In Experimental IR Meets Multilinguality, Multimodality, and Interaction Proceedings of the 11th International Conference of the CLEF Association. Springer, 169--193.
[34]
Wei Zhong and Richard Zanibbi. 2019. Structural Similarity Search for Formulas Using Leaf-Root Paths in Operator Subtrees. In European Conference on Information Retrieval.

Cited By

View all
  • (2024)Mathematical Information Retrieval: A ReviewACM Computing Surveys10.1145/369995357:3(1-34)Online publication date: 9-Oct-2024
  • (2024)Matching Problem Statements to Editorials in Competitive Programming2024 IEEE International Conference on Advanced Learning Technologies (ICALT)10.1109/ICALT61570.2024.00056(171-175)Online publication date: 1-Jul-2024
  • (2023)A Scientific Document Retrieval and Reordering Method by Incorporating HFS and LSDApplied Sciences10.3390/app13201120713:20(11207)Online publication date: 12-Oct-2023
  • Show More Cited By

Index Terms

  1. Learning to Rank for Mathematical Formula Retrieval

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval
    July 2021
    2998 pages
    ISBN:9781450380379
    DOI:10.1145/3404835
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 11 July 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. formula retrieval
    2. learning to rank
    3. math information retrieval

    Qualifiers

    • Research-article

    Conference

    SIGIR '21
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)42
    • Downloads (Last 6 weeks)5
    Reflects downloads up to 16 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Mathematical Information Retrieval: A ReviewACM Computing Surveys10.1145/369995357:3(1-34)Online publication date: 9-Oct-2024
    • (2024)Matching Problem Statements to Editorials in Competitive Programming2024 IEEE International Conference on Advanced Learning Technologies (ICALT)10.1109/ICALT61570.2024.00056(171-175)Online publication date: 1-Jul-2024
    • (2023)A Scientific Document Retrieval and Reordering Method by Incorporating HFS and LSDApplied Sciences10.3390/app13201120713:20(11207)Online publication date: 12-Oct-2023
    • (2023)Answer Retrieval for Math Questions Using Structural and Dense RetrievalExperimental IR Meets Multilinguality, Multimodality, and Interaction10.1007/978-3-031-42448-9_18(209-223)Online publication date: 18-Sep-2023
    • (2022)Retrieval and Ranking of Combining Ontology and Content Attributes for Scientific DocumentEntropy10.3390/e2406081024:6(810)Online publication date: 10-Jun-2022
    • (2022)MathUSE: Mathematical information retrieval system using universal sentence encoder modelJournal of Information Science10.1177/0165551522107733550:1(66-84)Online publication date: 4-Mar-2022
    • (2022)Contextualized Formula Search Using Math Abstract Meaning RepresentationProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557567(4329-4333)Online publication date: 17-Oct-2022
    • (2022)Advancing Math-Aware Search: The ARQMath-3 Lab at CLEF 2022Advances in Information Retrieval10.1007/978-3-030-99739-7_51(408-415)Online publication date: 10-Apr-2022

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media