Skip to main content
Log in

ArithmeticGPT: empowering small-size large language models with advanced arithmetic skills

  • Published:
Machine Learning Aims and scope Submit manuscript

Abstract

Large language models (LLMs) have shown remarkable capabilities in understanding and generating language across a wide range of domains. However, their performance in advanced arithmetic calculation remains a significant challenge, especially for small-size LLMs. Therefore, in this paper, we propose ArithmeticGPT, a practical framework designed to enhance the advanced arithmetic skills for small-size LLMs. We carefully curate an arithmetic instruction dataset, ArithInstruct, that is able to teach the small-size LLMs to trigger a self-developed internal calculation API for precise computations without explicit instructions. The advanced arithmetic calculation results are seamlessly generated within natural language sentences. Furthermore, we empirically design a practical three-stage strategy for fine-tuning the small-size LLMs with ArithInstruct to enable the advanced arithmetic skills and keep the models’ original abilities such as commonsense reasoning and question answering. We evaluate ArithmeticGPT on six public math related datasets with 17 state-of-the-art LLM baselines and experimental results demonstrate the superiority of our approach. To encourage reproducible research, we make our data and code publicly available at https://github.com/ai4ed/ArithmeticGPT.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availibility

No datasets were generated or analysed during the current study.

Notes

  1. For example, there are many different expressions indicating the value of 1,000, i.e., 1K, a thousand, 一千, 千 etc.

  2. https://crfm.stanford.edu/2023/03/13/alpaca.html.

  3. https://matheval.ai/

References

  • Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S.m, Anadkat, S., et al. (2023). GPT-4 technical report. arXiv preprint. arXiv:2303.08774

  • Ashkenazi, S., & Danan, Y. (2017). The role of mathematical anxiety and working memory on the performance of different types of arithmetic tasks. Trends in Neuroscience and Education, 7, 1–10.

    Article  MATH  Google Scholar 

  • Azerbayev, Z., Schoelkopf, H., Paster, K., Dos Santos, M., McAleer, S., Jiang, A., Deng, J., Biderman, S., & Welleck, S. (2023). Llemma: An open language model for mathematics. In The 3rd Workshop on Mathematical Reasoning and AI at NeurIPS’23,

  • Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., Huang, F. et al. (2023). Qwen technical report. arXiv preprint arXiv:2309.16609

  • Bengio, Y., Louradour, J., Collobert, R., & Weston, J. (2009). Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning (pp. 41–48).

  • Cai, T., Wang, X., Ma, T., Chen, X., & Zhou, D. (2023). Large language models as tool makers. In The Twelfth International Conference on Learning Representations.

  • Chen, W., Ma, X., Wang, X., & Cohen, W. W. (2022). Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. Transactions on Machine Learning Research.

  • Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. (2021). Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168

  • Creswell, A., Shanahan, M., & Higgins, I. (2022). Selection-Inference: Exploiting large language models for interpretable logical reasoning. In The Eleventh International Conference on Learning Representations

  • Demszky, D., Yang, D., Yeager, D. S., Bryan, C. J., Clapper, M., Chandhok, S., Eichstaedt, J. C., Hecht, C., Jamieson, J., Johnson, M., et al. (2023). Using large language models in psychology. Nature Reviews Psychology, 2(11), 688–701.

    Google Scholar 

  • Du, Z., Qian, Y., Liu, X., Ding, M., Qiu, J., Yang, Z., & Tang, J. (2022). Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 320–335).

  • Gao, L., Madaan, A., Zhou, S., Alon, U., Liu, P., Yang, Y., Callan, J., & Neubig, G. (2023). PAL: Program-aided language models. In International Conference on Machine Learning (pp. 10764–10799). PMLR.

  • Garcez, A., & Lamb, L. C. (2023). Neurosymbolic AI: The 3rd wave. Artificial Intelligence Review (pp. 1–20).

  • Ghosal, D., Majumder, N., Mehrish, A., & Poria, S. (2023). Text-to-audio generation using instruction-tuned llm and latent diffusion model. arXiv preprint arXiv:2304.13731

  • Gou, Z., Shao, Z., Gong, Y., Yang, Y., Huang, M., Duan, N., & Chen, W. ToRA: A tool-integrated reasoning agent for mathematical problem solving. In The Twelfth International Conference on Learning Representations.

  • Hendy, A., Abdelrehim, M., Sharaf, A., Raunak, V., Gabr, M., Matsushita, H., ... & Awadalla, H. H. (2023). How good are GPT models at machine translation? a comprehensive evaluation. arXiv preprint arXiv:2302.09210.

  • InternLM Team. (2023). InternLM: A multilingual language model with progressively enhanced capabilities.

  • Khot, T., Trivedi, H., Finlayson, M., Fu, Y., Richardson, K., Clark, P., & Sabharwal, A. (2022). Decomposed prompting: A modular approach for solving complex tasks. In The 11th International Conference on Learning Representations.

  • Kung, T. H., Cheatham, M., Medenilla, A., Sillos, C., De Leon, L., Elepaño, C., Madriaga, M., Aggabao, R., Diaz-Candido, G., Maningo, J., et al. (2023). Performance of chatgpt on usmle: Potential for ai-assisted medical education using large language models. PLoS Digital Health, 2(2), e0000198.

    Article  Google Scholar 

  • Lee, S., & Kim, G. (2023, July). Recursion of thought: A divide-and-conquer approach to multi-context reasoning with language models. In The 61st Annual Meeting Of The Association For Computational Linguistics.

  • Lewkowycz, A., Andreassen, A., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V., Slone, A., Anil, C., Schlag, I., Gutman-Solo, T., et al. (2022). Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35, 3843–3857.

    Google Scholar 

  • Li, Y., Lin, Z., Zhang, S., Fu, Q., Chen, B., Lou, J.-G., & Chen, W. (2023). Making language models better reasoners with step-aware verifier. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 5315–5333).

  • Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., ... & Cobbe, K. (2023). Let’s verify step by step. In The Twelfth International Conference on Learning Representations.

  • Lin, X. V., Mihaylov, T., Artetxe, M., Wang, T., Chen, S., Simig, D., Ott, M., Goyal, N., Bhosale, S., Du, J., et al. (2021). Few-shot learning with multilingual language models. arXiv preprint arXiv:2112.10668.

  • Liu, T. & Low, B. K. H. (2023). GOAT: Fine-tuned llama outperforms GPT-4 on arithmetic tasks. arXiv preprint arXiv:2305.14201.

  • Liu, W., Hu, H., Zhou, J., Ding, Y., Li, J., Zeng, J., He, M., Chen, Q., Jiang, B., Zhou, A., et al. (2023). Mathematical Language Models: A survey. arXiv preprint arXiv:2312.07622.

  • Luo, H., Sun, Q., Xu, C., Zhao, P., Lou, J., Tao, C., Geng, X., Lin, Q., Chen, S., & Zhang, D. (2023) WizardMath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583

  • Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., ... & Clark, P. Self-refine: Iterative refinement with self-feedback. In Thirty-seventh Conference on Neural Information Processing Systems.

  • Medhat, W., Hassan, A., & Korashy, H. (2014). Sentiment analysis algorithms and applications: A survey. Ain Shams Engineering Journal, 5(4), 1093–1113.

    Article  MATH  Google Scholar 

  • Nogueira, R., Jiang, Z., & Lin, J. (2021). Investigating the limitations of transformers with simple arithmetic tasks. arXiv preprint arXiv:2102.13019.

  • Schick, T., Dwivedi-Yu, J., Dessí, R., Raileanu, R., Lomeli, M., Hambro, E., ... & Scialom, T. (2023, December). Toolformer: language models can teach themselves to use tools. In Proceedings of the 37th International Conference on Neural Information Processing Systems (pp. 68539–68551).

  • Shah, N. H., Entwistle, D., & Pfeffer, M. A. (2023). Creation and adoption of large language models in medicine. JAMA: Journal of the American Medical Association, 330(9).

  • Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P. & Hashimoto, T. B. (2023). Stanford alpaca: An instruction-following llama model.

  • Thirunavukarasu, A. J., Ting, D. S. J., Elangovan, K., Gutierrez, L., Tan, T. F., & Ting, D. S. W. (2023). Large language models in medicine. Nature Medicine, 29(8), 1930–1940.

    Article  MATH  Google Scholar 

  • Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S. et al. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.

  • Wang, K., Ren, H., Zhou, A., Lu, Z., Luo, S., Shi, W., ... & Li, H. MathCoder: Seamless code integration in LLMs for enhanced mathematical reasoning. In The Twelfth International Conference on Learning Representations.

  • Wang, S. I., & Manning, C. D. (2012). Baselines and bigrams: Simple, good sentiment and topic classification. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 90–94).

  • Wang, X., Chen, Y., & Zhu, W. (2021). A survey on curriculum learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9), 4555–4576.

    MATH  Google Scholar 

  • Wang, Z., Xia, R., & Liu, P. (2023b). Generative ai for math: Part i–mathpile: A billion-token-scale pretraining corpus for math. arXiv preprint arXiv:2312.17120.

  • Wei, J., Wang, X., Schuurmans, B., Maarten, X., Fei, C. E., Le, Q. V., Zhou, D., et al. (2022). Chain-of-Thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35, 24824–24837.

    Google Scholar 

  • Weng, Y., Zhu, M., Xia, F., Li, B., He, S., Liu, S., Sun, B., Liu, K., & Zhao, J. (2023). Large language models are better reasoners with self-verification. In Findings of the Association for Computational Linguistics: EMNLP, 2023, 2550–2575.

    MATH  Google Scholar 

  • Wu, S., Irsoy, O., Lu, S., Dabravolski, V., Dredze, M., Gehrmann, S., ... & Mann, G. (2023). BloombergGPT: A large language model for finance (No. 2303.17564).

  • Yang, A., Xiao, B., Wang, B., Zhang, B., Bian, C., Yin, C., Lv, C., Pan, D., Wang, D., Yan, D., et al. (2023). Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305.

  • Yu, L., Jiang, W., Shi, H., Jincheng, Y. U., Liu, Z., Zhang, Y., ... & Liu, W (2023). MetaMath: Bootstrap your own mathematical questions for large language models. In The Twelfth International Conference on Learning Representations.

  • Yue, X., Qu, X., Zhang, G., Fu, Y., Huang, W., Sun, H., ... & Chen, W. (2023). MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning. In The Twelfth International Conference on Learning Representations.

  • Zeng, A., Liu, X., Du, Z., Wang, Z., Lai, H., Ding, M., Yang, Z., Xu, Y., Zheng, W., Xia, X., et al. (2022). GLM-130B: An open bilingual pre-trained model. In The 11th International Conference on Learning Representations.

  • Zhan, B., Guo, T., Li, X., Hou, M., Liang, Q., Gao, B., ... & Liu, Z. (2024, July). Knowledge tracing as language processing: A large-scale autoregressive paradigm. In International Conference on Artificial Intelligence in Education (pp. 177–191).

  • Zhang, Y., Tolmie, A., & Gordon, R. (2022). The relationship between working memory and arithmetic in primary school children: A meta-analysis. Brain Sciences, 13(1), 22.

    Article  MATH  Google Scholar 

  • Zhao, J. X., Xie, Y., Kawaguchi, K., He, J., & Xie, M. Q. (2023). Automatic model selection with large language models for reasoning. In The 2023 Conference on Empirical Methods in Natural Language Processing.

  • Zheng, Y., Li, X., Huang, Y., Liang, Q., Guo, T., Hou, M., ... & Luo, W. (2024, July). Automatic lesson plan generation via large language models with self-critique prompting. In International Conference on Artificial Intelligence in Education (pp. 16–178). Springer Nature Switzerland.

  • Zhou, A., Wang, K., Lu, Z., Shi, W., Luo, S., Qin, Z., ... & Li, H. (2024, May). Solving challenging math word problems using GPT-4 code interpreter with code-based self-verification. In 12th International Conference on Learning Representations (ICLR 2024).

  • Zhou, D., Schärli, N., Hou, L., Wei, J., Scales, N., Wang, X., Schuurmans, D., Cui, C., Bousquet, O., Le, Q. V., et al. (2022). Least-to-most prompting enables complex reasoning in large language models. In The 11th International Conference on Learning Representations.

Download references

Acknowledgments

This work was supported in part by National Key R&D Program of China, under Grant No. 2022YFC3303600 and in part by Key Laboratory of Smart Education of Guangdong Higher Education Institutes, Jinan University (2022LSYS003).

Author information

Authors and Affiliations

Authors

Contributions

A.B. and C.D.E wrote the main manuscript text and F.G. conducted the experiments and prepared figures 1-4. All authors reviewed the manuscript.

Corresponding author

Correspondence to Zitao Liu.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, Z., Zheng, Y., Yin, Z. et al. ArithmeticGPT: empowering small-size large language models with advanced arithmetic skills. Mach Learn 114, 24 (2025). https://doi.org/10.1007/s10994-024-06681-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10994-024-06681-1

Keywords