Abstract
The core and hard part of Mathematical Information Retrieval (MathIR) is formula retrieval. The datasets used for formula retrieval are usually scientific documents containing formulas. However, there is a lack of labeled datasets specifically for formula similarity. Contrastive learning can learn general features of datasets from unlabeled data and autonomously discover latent structures in the data. Furthermore, dense retrieval methods based on bi-encoders have gained increasing attention. Therefore, we propose CLFE, a simple framework for contrastive learning of formula embeddings. It can learn the latent structure and content information of formulas from unlabeled formulas and generate formula embeddings for formula retrieval. We design two frameworks, Contrastive Presentation MathML-Content MathML Learning and Contrastive LaTeX-MathML Learning, which initialize the encoders using transformer-based models and can produce superior LaTeX, Presentation MathML, and Content MathML embeddings for each formula. The combination of these three embeddings results in the Add embedding and Concat embedding. Finally, the formula retrieval task is achieved by finding the k nearest formulas in the vector space through nearest neighbor search. Experimental results show that when applying our proposed method on the NTCIR-12 dataset to retrieve 20 non-wildcard formulas and scoring the top-10 results, the highest achieved scores for mean P@10, mean nDCG@10, and MAP@10 are 0.6250, 0.9792, and 0.9381.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Davila, K., Zanibbi, R.: Layout and semantics: combining representations for mathematical formula search. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1165–1168 (2017)
Gao, T., Yao, X., Chen, D.: SimCSE: simple contrastive learning of sentence embeddings. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 6894–6910 (2021)
Karpukhin, V., et al.: Dense passage retrieval for open-domain question answering. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics (2020)
Li, J., Xu, Y., Cui, L., Wei, F.: MarkupLM: pre-training of text and markup language for visually-rich document understanding. arXiv preprint arXiv:2110.08518 (2021)
Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(11) (2008)
Mansouri, B., Oard, D.W., Zanibbi, R.: Contextualized formula search using math abstract meaning representation. In: Proceedings of the 31st ACM International Conference on Information & Knowledge Management, pp. 4329–4333 (2022)
Mansouri, B., Rohatgi, S., Oard, D.W., Wu, J., Giles, C.L., Zanibbi, R.: Tangent-CFT: an embedding model for mathematical formulas (2019)
Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using siamese BERT-networks. arXiv preprint arXiv:1908.10084 (2019)
Sojka, P., Líška, M.: The art of mathematics retrieval. In: Proceedings of the 11th ACM Symposium on Document Engineering, pp. 57–60 (2011)
Wang, T., Isola, P.: Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In: International Conference on Machine Learning, pp. 9929–9939. PMLR (2020)
Zanibbi, R., Aizawa, A., Kohlhase, M., Ounis, I., Topic, G., Davila, K.: NTCIR-12 MathIR task overview. In: NTCIR (2016)
Zhang, Y., Jiang, H., Miura, Y., Manning, C.D., Langlotz, C.P.: Contrastive learning of medical visual representations from paired images and text. In: Machine Learning for Healthcare Conference, pp. 2–25. PMLR (2022)
Zhong, W., Zanibbi, R.: Structural similarity search for formulas using leaf-root paths in operator subtrees. In: Azzopardi, L., Stein, B., Fuhr, N., Mayr, P., Hauff, C., Hiemstra, D. (eds.) ECIR 2019, Part I. LNCS, vol. 11437, pp. 116–129. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-15712-8_8
Acknowledgment
This work is supported by the Natural Science Foundation of Hebei Province of China under Grant F2019201329.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Wang, J., Tian, X. (2023). Math Information Retrieval with Contrastive Learning of Formula Embeddings. In: Zhang, F., Wang, H., Barhamgi, M., Chen, L., Zhou, R. (eds) Web Information Systems Engineering – WISE 2023. WISE 2023. Lecture Notes in Computer Science, vol 14306. Springer, Singapore. https://doi.org/10.1007/978-981-99-7254-8_8
Download citation
DOI: https://doi.org/10.1007/978-981-99-7254-8_8
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-7253-1
Online ISBN: 978-981-99-7254-8
eBook Packages: Computer ScienceComputer Science (R0)