Skip to main content

Math Information Retrieval with Contrastive Learning of Formula Embeddings

  • Conference paper
  • First Online:
Web Information Systems Engineering – WISE 2023 (WISE 2023)

Abstract

The core and hard part of Mathematical Information Retrieval (MathIR) is formula retrieval. The datasets used for formula retrieval are usually scientific documents containing formulas. However, there is a lack of labeled datasets specifically for formula similarity. Contrastive learning can learn general features of datasets from unlabeled data and autonomously discover latent structures in the data. Furthermore, dense retrieval methods based on bi-encoders have gained increasing attention. Therefore, we propose CLFE, a simple framework for contrastive learning of formula embeddings. It can learn the latent structure and content information of formulas from unlabeled formulas and generate formula embeddings for formula retrieval. We design two frameworks, Contrastive Presentation MathML-Content MathML Learning and Contrastive LaTeX-MathML Learning, which initialize the encoders using transformer-based models and can produce superior LaTeX, Presentation MathML, and Content MathML embeddings for each formula. The combination of these three embeddings results in the Add embedding and Concat embedding. Finally, the formula retrieval task is achieved by finding the k nearest formulas in the vector space through nearest neighbor search. Experimental results show that when applying our proposed method on the NTCIR-12 dataset to retrieve 20 non-wildcard formulas and scoring the top-10 results, the highest achieved scores for mean P@10, mean nDCG@10, and MAP@10 are 0.6250, 0.9792, and 0.9381.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Davila, K., Zanibbi, R.: Layout and semantics: combining representations for mathematical formula search. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1165–1168 (2017)

    Google Scholar 

  2. Gao, T., Yao, X., Chen, D.: SimCSE: simple contrastive learning of sentence embeddings. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 6894–6910 (2021)

    Google Scholar 

  3. Karpukhin, V., et al.: Dense passage retrieval for open-domain question answering. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics (2020)

    Google Scholar 

  4. Li, J., Xu, Y., Cui, L., Wei, F.: MarkupLM: pre-training of text and markup language for visually-rich document understanding. arXiv preprint arXiv:2110.08518 (2021)

  5. Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(11) (2008)

    Google Scholar 

  6. Mansouri, B., Oard, D.W., Zanibbi, R.: Contextualized formula search using math abstract meaning representation. In: Proceedings of the 31st ACM International Conference on Information & Knowledge Management, pp. 4329–4333 (2022)

    Google Scholar 

  7. Mansouri, B., Rohatgi, S., Oard, D.W., Wu, J., Giles, C.L., Zanibbi, R.: Tangent-CFT: an embedding model for mathematical formulas (2019)

    Google Scholar 

  8. Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using siamese BERT-networks. arXiv preprint arXiv:1908.10084 (2019)

  9. Sojka, P., Líška, M.: The art of mathematics retrieval. In: Proceedings of the 11th ACM Symposium on Document Engineering, pp. 57–60 (2011)

    Google Scholar 

  10. Wang, T., Isola, P.: Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In: International Conference on Machine Learning, pp. 9929–9939. PMLR (2020)

    Google Scholar 

  11. Zanibbi, R., Aizawa, A., Kohlhase, M., Ounis, I., Topic, G., Davila, K.: NTCIR-12 MathIR task overview. In: NTCIR (2016)

    Google Scholar 

  12. Zhang, Y., Jiang, H., Miura, Y., Manning, C.D., Langlotz, C.P.: Contrastive learning of medical visual representations from paired images and text. In: Machine Learning for Healthcare Conference, pp. 2–25. PMLR (2022)

    Google Scholar 

  13. Zhong, W., Zanibbi, R.: Structural similarity search for formulas using leaf-root paths in operator subtrees. In: Azzopardi, L., Stein, B., Fuhr, N., Mayr, P., Hauff, C., Hiemstra, D. (eds.) ECIR 2019, Part I. LNCS, vol. 11437, pp. 116–129. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-15712-8_8

    Chapter  Google Scholar 

Download references

Acknowledgment

This work is supported by the Natural Science Foundation of Hebei Province of China under Grant F2019201329.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xuedong Tian .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, J., Tian, X. (2023). Math Information Retrieval with Contrastive Learning of Formula Embeddings. In: Zhang, F., Wang, H., Barhamgi, M., Chen, L., Zhou, R. (eds) Web Information Systems Engineering – WISE 2023. WISE 2023. Lecture Notes in Computer Science, vol 14306. Springer, Singapore. https://doi.org/10.1007/978-981-99-7254-8_8

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-7254-8_8

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-7253-1

  • Online ISBN: 978-981-99-7254-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics