ArithmeticGPT: empowering small-size large language models with advanced arithmetic skills

Liu, Zitao; Zheng, Ying; Yin, Zhibo; Chen, Jiahao; Liu, Tianqiao; Tian, Mi; Luo, Weiqi

doi:10.1007/s10994-024-06681-1

ArithmeticGPT: empowering small-size large language models with advanced arithmetic skills

Published: 17 January 2025

Volume 114, article number 24, (2025)
Cite this article

Machine Learning Aims and scope Submit manuscript

Zitao Liu¹,
Ying Zheng¹,
Zhibo Yin²,
Jiahao Chen²,
Tianqiao Liu²,
Mi Tian² &
…
Weiqi Luo¹

235 Accesses
1 Altmetric
Explore all metrics

Abstract

Large language models (LLMs) have shown remarkable capabilities in understanding and generating language across a wide range of domains. However, their performance in advanced arithmetic calculation remains a significant challenge, especially for small-size LLMs. Therefore, in this paper, we propose ArithmeticGPT, a practical framework designed to enhance the advanced arithmetic skills for small-size LLMs. We carefully curate an arithmetic instruction dataset, ArithInstruct, that is able to teach the small-size LLMs to trigger a self-developed internal calculation API for precise computations without explicit instructions. The advanced arithmetic calculation results are seamlessly generated within natural language sentences. Furthermore, we empirically design a practical three-stage strategy for fine-tuning the small-size LLMs with ArithInstruct to enable the advanced arithmetic skills and keep the models’ original abilities such as commonsense reasoning and question answering. We evaluate ArithmeticGPT on six public math related datasets with 17 state-of-the-art LLM baselines and experimental results demonstrate the superiority of our approach. To encourage reproducible research, we make our data and code publicly available at https://github.com/ai4ed/ArithmeticGPT.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Analysis of Parent with Fine Tuned Large Language Model

Targeted training for numerical reasoning with large language models

Article 06 September 2024

Rationality of Thought Improves Reasoning in Large Language Models

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data availibility

No datasets were generated or analysed during the current study.

Notes

For example, there are many different expressions indicating the value of 1,000, i.e., 1K, a thousand, 一千, 千 etc.
https://crfm.stanford.edu/2023/03/13/alpaca.html.
https://matheval.ai/

References

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S.m, Anadkat, S., et al. (2023). GPT-4 technical report. arXiv preprint. arXiv:2303.08774
Ashkenazi, S., & Danan, Y. (2017). The role of mathematical anxiety and working memory on the performance of different types of arithmetic tasks. Trends in Neuroscience and Education, 7, 1–10.
Article MATH Google Scholar
Azerbayev, Z., Schoelkopf, H., Paster, K., Dos Santos, M., McAleer, S., Jiang, A., Deng, J., Biderman, S., & Welleck, S. (2023). Llemma: An open language model for mathematics. In The 3rd Workshop on Mathematical Reasoning and AI at NeurIPS’23,
Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., Huang, F. et al. (2023). Qwen technical report. arXiv preprint arXiv:2309.16609
Bengio, Y., Louradour, J., Collobert, R., & Weston, J. (2009). Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning (pp. 41–48).
Cai, T., Wang, X., Ma, T., Chen, X., & Zhou, D. (2023). Large language models as tool makers. In The Twelfth International Conference on Learning Representations.
Chen, W., Ma, X., Wang, X., & Cohen, W. W. (2022). Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. Transactions on Machine Learning Research.
Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. (2021). Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168
Creswell, A., Shanahan, M., & Higgins, I. (2022). Selection-Inference: Exploiting large language models for interpretable logical reasoning. In The Eleventh International Conference on Learning Representations
Demszky, D., Yang, D., Yeager, D. S., Bryan, C. J., Clapper, M., Chandhok, S., Eichstaedt, J. C., Hecht, C., Jamieson, J., Johnson, M., et al. (2023). Using large language models in psychology. Nature Reviews Psychology, 2(11), 688–701.
Google Scholar
Du, Z., Qian, Y., Liu, X., Ding, M., Qiu, J., Yang, Z., & Tang, J. (2022). Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 320–335).
Gao, L., Madaan, A., Zhou, S., Alon, U., Liu, P., Yang, Y., Callan, J., & Neubig, G. (2023). PAL: Program-aided language models. In International Conference on Machine Learning (pp. 10764–10799). PMLR.
Garcez, A., & Lamb, L. C. (2023). Neurosymbolic AI: The 3rd wave. Artificial Intelligence Review (pp. 1–20).
Ghosal, D., Majumder, N., Mehrish, A., & Poria, S. (2023). Text-to-audio generation using instruction-tuned llm and latent diffusion model. arXiv preprint arXiv:2304.13731
Gou, Z., Shao, Z., Gong, Y., Yang, Y., Huang, M., Duan, N., & Chen, W. ToRA: A tool-integrated reasoning agent for mathematical problem solving. In The Twelfth International Conference on Learning Representations.
Hendy, A., Abdelrehim, M., Sharaf, A., Raunak, V., Gabr, M., Matsushita, H., ... & Awadalla, H. H. (2023). How good are GPT models at machine translation? a comprehensive evaluation. arXiv preprint arXiv:2302.09210.
InternLM Team. (2023). InternLM: A multilingual language model with progressively enhanced capabilities.
Khot, T., Trivedi, H., Finlayson, M., Fu, Y., Richardson, K., Clark, P., & Sabharwal, A. (2022). Decomposed prompting: A modular approach for solving complex tasks. In The 11th International Conference on Learning Representations.
Kung, T. H., Cheatham, M., Medenilla, A., Sillos, C., De Leon, L., Elepaño, C., Madriaga, M., Aggabao, R., Diaz-Candido, G., Maningo, J., et al. (2023). Performance of chatgpt on usmle: Potential for ai-assisted medical education using large language models. PLoS Digital Health, 2(2), e0000198.
Article Google Scholar
Lee, S., & Kim, G. (2023, July). Recursion of thought: A divide-and-conquer approach to multi-context reasoning with language models. In The 61st Annual Meeting Of The Association For Computational Linguistics.
Lewkowycz, A., Andreassen, A., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V., Slone, A., Anil, C., Schlag, I., Gutman-Solo, T., et al. (2022). Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35, 3843–3857.
Google Scholar
Li, Y., Lin, Z., Zhang, S., Fu, Q., Chen, B., Lou, J.-G., & Chen, W. (2023). Making language models better reasoners with step-aware verifier. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 5315–5333).
Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., ... & Cobbe, K. (2023). Let’s verify step by step. In The Twelfth International Conference on Learning Representations.
Lin, X. V., Mihaylov, T., Artetxe, M., Wang, T., Chen, S., Simig, D., Ott, M., Goyal, N., Bhosale, S., Du, J., et al. (2021). Few-shot learning with multilingual language models. arXiv preprint arXiv:2112.10668.
Liu, T. & Low, B. K. H. (2023). GOAT: Fine-tuned llama outperforms GPT-4 on arithmetic tasks. arXiv preprint arXiv:2305.14201.
Liu, W., Hu, H., Zhou, J., Ding, Y., Li, J., Zeng, J., He, M., Chen, Q., Jiang, B., Zhou, A., et al. (2023). Mathematical Language Models: A survey. arXiv preprint arXiv:2312.07622.
Luo, H., Sun, Q., Xu, C., Zhao, P., Lou, J., Tao, C., Geng, X., Lin, Q., Chen, S., & Zhang, D. (2023) WizardMath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583
Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., ... & Clark, P. Self-refine: Iterative refinement with self-feedback. In Thirty-seventh Conference on Neural Information Processing Systems.
Medhat, W., Hassan, A., & Korashy, H. (2014). Sentiment analysis algorithms and applications: A survey. Ain Shams Engineering Journal, 5(4), 1093–1113.
Article MATH Google Scholar
Nogueira, R., Jiang, Z., & Lin, J. (2021). Investigating the limitations of transformers with simple arithmetic tasks. arXiv preprint arXiv:2102.13019.
Schick, T., Dwivedi-Yu, J., Dessí, R., Raileanu, R., Lomeli, M., Hambro, E., ... & Scialom, T. (2023, December). Toolformer: language models can teach themselves to use tools. In Proceedings of the 37th International Conference on Neural Information Processing Systems (pp. 68539–68551).
Shah, N. H., Entwistle, D., & Pfeffer, M. A. (2023). Creation and adoption of large language models in medicine. JAMA: Journal of the American Medical Association, 330(9).
Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P. & Hashimoto, T. B. (2023). Stanford alpaca: An instruction-following llama model.
Thirunavukarasu, A. J., Ting, D. S. J., Elangovan, K., Gutierrez, L., Tan, T. F., & Ting, D. S. W. (2023). Large language models in medicine. Nature Medicine, 29(8), 1930–1940.
Article MATH Google Scholar
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S. et al. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
Wang, K., Ren, H., Zhou, A., Lu, Z., Luo, S., Shi, W., ... & Li, H. MathCoder: Seamless code integration in LLMs for enhanced mathematical reasoning. In The Twelfth International Conference on Learning Representations.
Wang, S. I., & Manning, C. D. (2012). Baselines and bigrams: Simple, good sentiment and topic classification. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 90–94).
Wang, X., Chen, Y., & Zhu, W. (2021). A survey on curriculum learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9), 4555–4576.
MATH Google Scholar
Wang, Z., Xia, R., & Liu, P. (2023b). Generative ai for math: Part i–mathpile: A billion-token-scale pretraining corpus for math. arXiv preprint arXiv:2312.17120.
Wei, J., Wang, X., Schuurmans, B., Maarten, X., Fei, C. E., Le, Q. V., Zhou, D., et al. (2022). Chain-of-Thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35, 24824–24837.
Google Scholar
Weng, Y., Zhu, M., Xia, F., Li, B., He, S., Liu, S., Sun, B., Liu, K., & Zhao, J. (2023). Large language models are better reasoners with self-verification. In Findings of the Association for Computational Linguistics: EMNLP, 2023, 2550–2575.
MATH Google Scholar
Wu, S., Irsoy, O., Lu, S., Dabravolski, V., Dredze, M., Gehrmann, S., ... & Mann, G. (2023). BloombergGPT: A large language model for finance (No. 2303.17564).
Yang, A., Xiao, B., Wang, B., Zhang, B., Bian, C., Yin, C., Lv, C., Pan, D., Wang, D., Yan, D., et al. (2023). Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305.
Yu, L., Jiang, W., Shi, H., Jincheng, Y. U., Liu, Z., Zhang, Y., ... & Liu, W (2023). MetaMath: Bootstrap your own mathematical questions for large language models. In The Twelfth International Conference on Learning Representations.
Yue, X., Qu, X., Zhang, G., Fu, Y., Huang, W., Sun, H., ... & Chen, W. (2023). MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning. In The Twelfth International Conference on Learning Representations.
Zeng, A., Liu, X., Du, Z., Wang, Z., Lai, H., Ding, M., Yang, Z., Xu, Y., Zheng, W., Xia, X., et al. (2022). GLM-130B: An open bilingual pre-trained model. In The 11th International Conference on Learning Representations.
Zhan, B., Guo, T., Li, X., Hou, M., Liang, Q., Gao, B., ... & Liu, Z. (2024, July). Knowledge tracing as language processing: A large-scale autoregressive paradigm. In International Conference on Artificial Intelligence in Education (pp. 177–191).
Zhang, Y., Tolmie, A., & Gordon, R. (2022). The relationship between working memory and arithmetic in primary school children: A meta-analysis. Brain Sciences, 13(1), 22.
Article MATH Google Scholar
Zhao, J. X., Xie, Y., Kawaguchi, K., He, J., & Xie, M. Q. (2023). Automatic model selection with large language models for reasoning. In The 2023 Conference on Empirical Methods in Natural Language Processing.
Zheng, Y., Li, X., Huang, Y., Liang, Q., Guo, T., Hou, M., ... & Luo, W. (2024, July). Automatic lesson plan generation via large language models with self-critique prompting. In International Conference on Artificial Intelligence in Education (pp. 16–178). Springer Nature Switzerland.
Zhou, A., Wang, K., Lu, Z., Shi, W., Luo, S., Qin, Z., ... & Li, H. (2024, May). Solving challenging math word problems using GPT-4 code interpreter with code-based self-verification. In 12th International Conference on Learning Representations (ICLR 2024).
Zhou, D., Schärli, N., Hou, L., Wei, J., Scales, N., Wang, X., Schuurmans, D., Cui, C., Bousquet, O., Le, Q. V., et al. (2022). Least-to-most prompting enables complex reasoning in large language models. In The 11th International Conference on Learning Representations.

Download references

Acknowledgments

This work was supported in part by National Key R&D Program of China, under Grant No. 2022YFC3303600 and in part by Key Laboratory of Smart Education of Guangdong Higher Education Institutes, Jinan University (2022LSYS003).

Author information

Authors and Affiliations

Guangzhou, China
Zitao Liu, Ying Zheng & Weiqi Luo
Beijing, China
Zhibo Yin, Jiahao Chen, Tianqiao Liu & Mi Tian

Authors

Zitao Liu
View author publications
You can also search for this author inPubMed Google Scholar
Ying Zheng
View author publications
You can also search for this author inPubMed Google Scholar
Zhibo Yin
View author publications
You can also search for this author inPubMed Google Scholar
Jiahao Chen
View author publications
You can also search for this author inPubMed Google Scholar
Tianqiao Liu
View author publications
You can also search for this author inPubMed Google Scholar
Mi Tian
View author publications
You can also search for this author inPubMed Google Scholar
Weiqi Luo
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

A.B. and C.D.E wrote the main manuscript text and F.G. conducted the experiments and prepared figures 1-4. All authors reviewed the manuscript.

Corresponding author

Correspondence to Zitao Liu.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Liu, Z., Zheng, Y., Yin, Z. et al. ArithmeticGPT: empowering small-size large language models with advanced arithmetic skills. Mach Learn 114, 24 (2025). https://doi.org/10.1007/s10994-024-06681-1

Download citation

Received: 30 May 2024
Revised: 06 August 2024
Accepted: 12 December 2024
Published: 17 January 2025
DOI: https://doi.org/10.1007/s10994-024-06681-1

Keywords

Part of a collection:

Special Issue for ACML 2024

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

ArithmeticGPT: empowering small-size large language models with advanced arithmetic skills

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Analysis of Parent with Fine Tuned Large Language Model

Targeted training for numerical reasoning with large language models

Rationality of Thought Improves Reasoning in Large Language Models

Explore related subjects

Data availibility

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now