skip to main content
10.1145/3539618.3591746acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

One Blade for One Purpose: Advancing Math Information Retrieval using Hybrid Search

Published: 18 July 2023 Publication History

Abstract

Neural retrievers have been shown to be effective for math-aware search. Their ability to cope with math symbol mismatches, to represent highly contextualized semantics, and to learn effective representations are critical to improving math information retrieval. However, the most effective retriever for math remains impractical as it depends on token-level dense representations for each math token, which leads to prohibitive storage demands, especially considering that math content generally consumes more tokens. In this work, we try to alleviate this efficiency bottleneck while boosting math information retrieval effectiveness via hybrid search. To this end, we propose MABOWDOR, a Math-Aware Bestof-Worlds Domain Optimized Retriever, which has an unsupervised structure search component, a dense retriever, and optionally a sparse retriever on top of a domain-adapted backbone learned by context-enhanced pretraining, each addressing a different need in retrieving heterogeneous data from math documents. Our hybrid search outperforms the previous state-of-the-art math IR system while eliminating efficiency bottlenecks. Our system is available at https://github.com/approach0/pya0.

Supplemental Material

MP4 File
In this work, we try to advance math information retrieval effectiveness via hybrid search. We propose MABOWDOR, a Math-Aware Best-of-Worlds Domain Optimized Retriever, which has an unsupervised structure search component, a dense retriever, and optionally a sparse retriever on top of a domain-adapted backbone learned by context-enhanced pretraining, each addressing a different need in retrieving heterogeneous data from math documents. Our hybrid search outperforms the previous state-of-the-art math IR system while eliminating efficiency bottlenecks.

References

[1]
Sebastian Bruch, Siyu Gai, and Amir Ingber. 2022. An analysis of fusion functions for hybrid retrieval. (2022). arxiv: 2210.11934
[2]
Chris Buckley and Ellen M Voorhees. 2004. Retrieval evaluation with incomplete information. In SIGIR.
[3]
Wei-Cheng Chang, Felix X Yu, Yin-Wen Chang, Yiming Yang, and Sanjiv Kumar. 2020. Pre-training tasks for embedding-based large-scale retrieval. (2020). arxiv: 2002.03932
[4]
Yifan Dai, Liangyu Chen, and Zihan Zhang. 2020. An N-ary tree-based model for similarity evaluation on mathematical formulae. In IEEE International Conference on Systems, Man, and Cybernetics (SMC). IEEE.
[5]
Kenny Davila and Richard Zanibbi. 2017. Layout and semantics: Combining representations for mathematical formula search. In SIGIR.
[6]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. (2019). arxiv: 1810.04805
[7]
Thibault Formal, Carlos Lassance, Benjamin Piwowarski, and Stéphane Clinchant. 2021a. SPLADE v2: Sparse lexical and expansion model for information retrieval. (2021). arxiv: 2109.10086
[8]
Thibault Formal, Benjamin Piwowarski, and Stéphane Clinchant. 2021b. SPLADE: Sparse lexical and expansion model for first stage ranking. (2021). arxiv: 2107.05720
[9]
Dallas Fraser, Andrew Kane, and Frank Tompa. 2018. Choosing math features for BM25 ranking with Tangent-L. In DocEng.
[10]
Luyu Gao and Jamie Callan. 2021a. Condenser: A pre-training architecture for dense retrieval. (2021). arxiv: 2104.08253
[11]
Luyu Gao and Jamie Callan. 2021b. Unsupervised corpus aware language model pre-training for dense passage retrieval. (2021). arxiv: 2108.05540
[12]
Liangcai Gao, Zhuoren Jiang, Yue Yin, Ke Yuan, Zuoyu Yan, and Zhi Tang. 2017. Preliminary exploration of formula embedding for mathematical information retrieval: Can mathematical formulae be embedded like a natural language? (2017). arxiv: 1707.05154
[13]
Liangcai Gao, Ke Yuan, Yuehan Wang, Zhuoren Jiang, and Zhi Tang. 2016. The math retrieval system of ICST for NTCIR-12 MathIR Task. In NTCIR.
[14]
Martin Geletka, Vojt?ch Kalivoda, Michal ?tefánik, Marek Toma, and Petr Sojka. 2022. Diverse semantics representation is king: MIRMU and MSM at ARQMath 2022. In CLEF.
[15]
Yoshinori Hijikata, Hideki Hashimoto, and Shogo Nishida. 2009. Search mathematical formulas by mathematical formulas. In SHI (Symposium on Human Interface).
[16]
Xuan Hu, Liangcai Gao, Xiao-Rong Lin, Zhi Tang, Xiaofan Lin, and Josef B. Baker. 2013. WikiMirs: A mathematical information retrieval system for wikipedia. In JCDL.
[17]
Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. 2021. Towards unsupervised dense information retrieval with contrastive learning. (2021). arxiv: 2112.09118
[18]
Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data (2019).
[19]
Andrew Kane, Yin Ki Ng, and Frank Tompa. 2022. Dowsing for answers to math questions. Doing better with less. CLEF (2022).
[20]
Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval for open-domain question answering. (2020). arxiv: 2004.04906
[21]
Omar Khattab and Matei Zaharia. 2020. ColBERT: Efficient and effective passage search via contextualized late interaction over BERT. In SIGIR.
[22]
Giovanni Yoko Kristianto, Goran Topic, and Akiko Aizawa. 2016. MCAT math retrieval system for NTCIR-12 MathIR task. In NTCIR.
[23]
Kriste Krstovski and David M Blei. 2018. Equation embeddings. (2018). arxiv: 1803.09123
[24]
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. ALBERT: A lite BERT for self-supervised learning of language representations. arxiv: 1909.11942
[25]
Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. 2019. Latent retrieval for weakly supervised open domain question answering. (2019). arxiv: 1906.00300
[26]
Minghan Li and Eric Gaussier. 2022. Domain adaptation for dense retrieval through self-supervision by pseudo-relevance labeling. (2022). arxiv: 2212.06552
[27]
Jimmy Lin. 2021. A proposed conceptual framework for a representational approach to information retrieval. arxiv: 2110.01529
[28]
Sheng-Chieh Lin and Jimmy Lin. 2021. Densifying sparse representations for passage retrieval by representational slicing. (2021). arxiv: 2112.04666
[29]
Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. (2017). arxiv: 1711.05101
[30]
Shuqi Lu, Di He, Chenyan Xiong, Guolin Ke, Waleed Malik, Zhicheng Dou, Paul Bennett, Tieyan Liu, and Arnold Overwijk. 2021. Less is more: Pre-train a strong text encoder for dense retrieval using a weak decoder. (2021). arxiv: 2102.09206
[31]
Yuanhua Lv and ChengXiang Zhai. 2011. Lower-bounding term frequency normalization. In CIKM.
[32]
Xinyu Ma, Jiafeng Guo, Ruqing Zhang, Yixing Fan, and Xueqi Cheng. 2022. Pre-train a discriminative text encoder for dense retrieval via contrastive span prediction. (2022). arxiv: 2204.10641
[33]
Joel Mackenzie, Andrew Trotman, and Jimmy Lin. 2021. Wacky weights in learned sparse representations and the revenge of score-at-a-time query evaluation. (2021). arxiv: 2110.11540
[34]
Joel Mackenzie, Andrew Trotman, and Jimmy Lin. 2023. Efficient Document-at-a-Time and Score-at-a-Time Query Evaluation for Learned Sparse Representations. ACM TOIS (2023).
[35]
Yu. A. Malkov and D. A. Yashunin. 2018. Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs. arxiv: 1603.09320
[36]
Antonio Mallia, Joel Mackenzie, Torsten Suel, and Nicola Tonellotto. 2022. Faster learned sparse retrieval with guided traversal. (2022). arxiv: 2204.11314
[37]
Behrooz Mansouri, Vít Novotný, Anurag Agarwal, Douglas W. Oard, and Richard Zanibbi. 2022. Overview of ARQMath-3 (2022): Third CLEF lab on Answer Retrieval for questions on math (working notes version). In CLEF.
[38]
Behrooz Mansouri, Douglas W Oard, and Richard Zanibbi. 2021a. DPRL systems in the CLEF 2022 ARQMath Lab: Introducing MathAMR for math-aware search. CLEF (2021).
[39]
Behrooz Mansouri, Shaurya Rohatgi, Douglas W. Oard, Jian Wu, C Lee Giles, and Richard Zanibbi. 2019. Tangent-CFT: An embedding model for mathematical formulas. In SIGIR.
[40]
Behrooz Mansouri, Richard Zanibbi, Douglas W. Oard, and Anurag Agarwal. 2021b. Overview of ARQMath-2 (2021): Second CLEF Lab on answer retrieval for questions on math (working notes version). In CLEF.
[41]
Bruce R. Miller and Abdou Youssef. 2003. Technical aspects of the digital library of mathematical functions. In AMAI.
[42]
Yin Ki Ng, Dallas Fraser, Besat Kassaie, and Frank Tompa. 2021. Dowsing for answers to math questions: Ongoing viability of traditional MathIR. In CLEF.
[43]
Yin Ki Ng, Dallas J. Fraser, Besat Kassaie, George Labahn, Mirette S Marzouk, Frank Tompa, and Kevin Wang. 2020. Dowsing for math answers with Tangent-L. In CLEF.
[44]
Rodrigo Nogueira, Zhiying Jiang, and Jimmy Lin. 2021. Investigating the limitations of transformers with simple arithmetic tasks. (2021). arxiv: 2102.13019
[45]
Vít Novotnỳ and Michal Štefánik. 2022. Combining sparse and dense information retrieval. CLEF (2022).
[46]
Vít Novotnỳ, Michal Štefánik, Dávid Lupták, Martin Geletka, Petr Zelina, and Petr Sojka. 2021. Ensembling Ten Math Information Retrieval Systems In CLEF.
[47]
Shuai Peng, Ke Yuan, Liangcai Gao, and Zhi Tang. 2021. MathBERT: A pre-trained model for mathematical formula understanding. (2021). arxiv: 2105.00377
[48]
Lukas Pfahler and Katharina Morik. 2022. Self-Supervised Pretraining of Graph Neural Network for the Retrieval of Related Mathematical Expressions in Scientific Articles. (2022). arxiv: 2209.00446
[49]
Martin F Porter. 1980. An algorithm for suffix stripping. Program (1980).
[50]
I Qunis, G Amati, V Plachouras, B He, C Macdonald, and C Lioma. 2006. A high performance and scalable information retrieval plateform. In SIGR.
[51]
Ori Ram, Gal Shachaf, Omer Levy, Jonathan Berant, and Amir Globerson. 2021. Learning to retrieve passages without supervision. (2021). arxiv: 2112.07708
[52]
Anja Reusch, Maik Thiele, and Wolfgang Lehner. 2021a. An ALBERT-based similarity measure for mathematical answer retrieval. In SIGIR.
[53]
Anja Reusch, Maik Thiele, and Wolfgang Lehner. 2021b. TU_DBS in the ARQMath Lab 2021, CLEF. In CLEF.
[54]
Anja Reusch, Maik Thiele, and Wolfgang Lehner. 2022. Transformer-encoder and decoder models for questions on math. CLEF (2022).
[55]
Tetsuya Sakai and Noriko Kando. 2008. On information retrieval metrics designed for evaluation with incomplete relevance assessments. Information Retrieval (2008).
[56]
Keshav Santhanam, Omar Khattab, Christopher Potts, and Matei Zaharia. 2022a. PLAID: An efficient engine for late interaction retrieval. (2022). arxiv: 2205.09707
[57]
Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. 2021. ColBERTv2: Effective and efficient retrieval via lightweight late interaction. (2021). arxiv: 2112.01488
[58]
Keshav Santhanam, Jon Saad-Falcon, Martin Franz, Omar Khattab, Avirup Sil, Radu Florian, Md Arafat Sultan, Salim Roukos, Matei Zaharia, and Christopher Potts. 2022b. Moving beyond downstream task accuracy for information retrieval benchmarking. (2022). arxiv: 2212.01340
[59]
Till Hendrik Schulz, Tamás Horváth, Pascal Welke, and Stefan Wrobel. 2022. A generalized weisfeiler-lehman graph kernel. Machine Learning (2022).
[60]
Tao Shen, Xiubo Geng, Chongyang Tao, Can Xu, Kai Zhang, and Daxin Jiang. 2022. UnifieR: A Unified Retriever for Large-Scale Retrieval. (2022). arxiv: 2205.11194
[61]
Yujin Song and Xiaoyu Chen. 2021. Searching for mathematical formulas based on graph representation learning. In CICM.
[62]
Terrier contributors. 2014. Terrier documenetation. http://terrier.org/docs/v4.0/javadoc/org/terrier/matching/models/TF_IDF.html [Online; accessed 2022-12-17].
[63]
Nandan Thakur, Nils Reimers, and Jimmy Lin. 2022. Domain adaptation for memory-efficient dense retrieval. (2022). arxiv: 2205.11498
[64]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS.
[65]
Kexin Wang, Nandan Thakur, Nils Reimers, and Iryna Gurevych. 2021a. GPL: Generative pseudo labeling for unsupervised domain adaptation of dense retrieval. (2021). arxiv: 2112.07577
[66]
Zichao Wang, Mengxue Zhang, Richard G. Baraniuk, and Andrew S. Lan. 2021b. Scientific Formula Retrieval via Tree Embeddings. In 2021 IEEE International Conference on Big Data.
[67]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. 2020. Transformers: State-of-the-Art Natural Language Processing. In EMNLP.
[68]
Xing Wu, Guangyuan Ma, Meng Lin, Zijia Lin, Zhongyuan Wang, and Songlin Hu. 2022. Contextual mask auto-encoder for dense passage retrieval. (2022). arxiv: 2208.07670
[69]
Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2018. How powerful are graph neural networks? (2018). arxiv: 1810.00826
[70]
Peilin Yang, Hui Fang, and Jimmy Lin. 2017. Anserini: Enabling the use of Lucene for information retrieval research. In SIGIR.
[71]
Michihiro Yasunaga and John D Lafferty. 2019. TopicEQ: A joint topic and mathematical equation model for scientific texts. In AAAI.
[72]
Richard Zanibbi, Akiko Aizawa, Michael Kohlhase, Iadh Ounis, Goran Topic, and Kenny Davila. 2016a. NTCIR-12 MathIR task overview. In NTCIR.
[73]
Richard Zanibbi and Dorothea Blostein. 2012. Recognition and retrieval of mathematical expressions. In IJDAR.
[74]
Richard Zanibbi, Kenny Davila, Andrew Kane, and Frank Tompa. 2016b. Multi-stage math formula search: Using appearance-based similarity metrics at scale. In SIGIR.
[75]
Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Jiafeng Guo, Min Zhang, and Shaoping Ma. 2022. Learning discrete representations via constrained clustering for effective and efficient dense retrieval. In WSDM.
[76]
Jin Zhao, Min-Yen Kan, and Yin Leng Theng. 2008. Math information retrieval: user requirements and prototype implementation. In JCDL.
[77]
Wei Zhong and Jimmy Lin. 2021. PyA0: A Python toolkit for accessible math-aware search. In SIGIR.
[78]
Wei Zhong, Shaurya Rohatgi, Jian Wu, Lee Giles, and Richard Zanibbi. 2020. Accelerating substructure similarity search for formula retrieval. In ECIR.
[79]
Wei Zhong, Yuqing Xie, and Jimmy Lin. 2022a. Applying structural and dense semantic matching for the ARQMath Lab 2022, CLEF. CLEF (2022).
[80]
Wei Zhong, Jheng-Hong Yang, and Jimmy Lin. 2022b. Evaluating token-level and passage-level dense retrieval models for math information retrieval. arxiv: 2203.11163
[81]
Wei Zhong and Richard Zanibbi. 2019. Structural similarity search for formulas using leaf-root paths in operator subtrees. In ECIR.
[82]
Qi Zhu, Yuxian Gu, Lingxiao Luo, Bing Li, Cheng Li, Wei Peng, Minlie Huang, and Xiaoyan Zhu. 2021. When does further pre-training MLM help? An empirical study on task-oriented dialog pre-training. In EMNLP.
[83]
Shengyao Zhuang and Guido Zuccon. 2021. Fast passage re-ranking with contextualized exact term matching and efficient passage expansion. (2021). arxiv: 2108.08513

Cited By

View all
  • (2025)Advances in Vector SearchProceedings of the Eighteenth ACM International Conference on Web Search and Data Mining10.1145/3701551.3703482(995-997)Online publication date: 10-Mar-2025
  • (2024)Mathematical Information Retrieval: A ReviewACM Computing Surveys10.1145/369995357:3(1-34)Online publication date: 11-Nov-2024
  • (2024)Can LLMs Master Math? Investigating Large Language Models on Math Stack ExchangeProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657945(2316-2320)Online publication date: 10-Jul-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval
July 2023
3567 pages
ISBN:9781450394086
DOI:10.1145/3539618
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 July 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. hybrid search
  2. math information retrieval
  3. neural retriever

Qualifiers

  • Research-article

Funding Sources

Conference

SIGIR '23
Sponsor:

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)143
  • Downloads (Last 6 weeks)7
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Advances in Vector SearchProceedings of the Eighteenth ACM International Conference on Web Search and Data Mining10.1145/3701551.3703482(995-997)Online publication date: 10-Mar-2025
  • (2024)Mathematical Information Retrieval: A ReviewACM Computing Surveys10.1145/369995357:3(1-34)Online publication date: 11-Nov-2024
  • (2024)Can LLMs Master Math? Investigating Large Language Models on Math Stack ExchangeProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657945(2316-2320)Online publication date: 10-Jul-2024
  • (2024)The Effectiveness of Graph Contrastive Learning on Mathematical Information RetrievalAdvances on Graph-Based Approaches in Information Retrieval10.1007/978-3-031-71382-8_5(60-72)Online publication date: 10-Oct-2024
  • (2024)Taxonomy of Mathematical PlagiarismAdvances in Information Retrieval10.1007/978-3-031-56066-8_2(12-20)Online publication date: 24-Mar-2024
  • (2024)Investigating the Usage of Formulae in Mathematical Answer RetrievalAdvances in Information Retrieval10.1007/978-3-031-56027-9_15(247-261)Online publication date: 24-Mar-2024
  • (2023)Answer Retrieval for Math Questions Using Structural and Dense RetrievalExperimental IR Meets Multilinguality, Multimodality, and Interaction10.1007/978-3-031-42448-9_18(209-223)Online publication date: 18-Sep-2023

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media