Abstract
Large language models (LLMs) have shown remarkable capabilities in understanding and generating language across a wide range of domains. However, their performance in advanced arithmetic calculation remains a significant challenge, especially for small-size LLMs. Therefore, in this paper, we propose ArithmeticGPT, a practical framework designed to enhance the advanced arithmetic skills for small-size LLMs. We carefully curate an arithmetic instruction dataset, ArithInstruct, that is able to teach the small-size LLMs to trigger a self-developed internal calculation API for precise computations without explicit instructions. The advanced arithmetic calculation results are seamlessly generated within natural language sentences. Furthermore, we empirically design a practical three-stage strategy for fine-tuning the small-size LLMs with ArithInstruct to enable the advanced arithmetic skills and keep the models’ original abilities such as commonsense reasoning and question answering. We evaluate ArithmeticGPT on six public math related datasets with 17 state-of-the-art LLM baselines and experimental results demonstrate the superiority of our approach. To encourage reproducible research, we make our data and code publicly available at https://github.com/ai4ed/ArithmeticGPT.




Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availibility
No datasets were generated or analysed during the current study.
Notes
For example, there are many different expressions indicating the value of 1,000, i.e., 1K, a thousand, 一千, 千 etc.
References
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S.m, Anadkat, S., et al. (2023). GPT-4 technical report. arXiv preprint. arXiv:2303.08774
Ashkenazi, S., & Danan, Y. (2017). The role of mathematical anxiety and working memory on the performance of different types of arithmetic tasks. Trends in Neuroscience and Education, 7, 1–10.
Azerbayev, Z., Schoelkopf, H., Paster, K., Dos Santos, M., McAleer, S., Jiang, A., Deng, J., Biderman, S., & Welleck, S. (2023). Llemma: An open language model for mathematics. In The 3rd Workshop on Mathematical Reasoning and AI at NeurIPS’23,
Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., Huang, F. et al. (2023). Qwen technical report. arXiv preprint arXiv:2309.16609
Bengio, Y., Louradour, J., Collobert, R., & Weston, J. (2009). Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning (pp. 41–48).
Cai, T., Wang, X., Ma, T., Chen, X., & Zhou, D. (2023). Large language models as tool makers. In The Twelfth International Conference on Learning Representations.
Chen, W., Ma, X., Wang, X., & Cohen, W. W. (2022). Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. Transactions on Machine Learning Research.
Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. (2021). Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168
Creswell, A., Shanahan, M., & Higgins, I. (2022). Selection-Inference: Exploiting large language models for interpretable logical reasoning. In The Eleventh International Conference on Learning Representations
Demszky, D., Yang, D., Yeager, D. S., Bryan, C. J., Clapper, M., Chandhok, S., Eichstaedt, J. C., Hecht, C., Jamieson, J., Johnson, M., et al. (2023). Using large language models in psychology. Nature Reviews Psychology, 2(11), 688–701.
Du, Z., Qian, Y., Liu, X., Ding, M., Qiu, J., Yang, Z., & Tang, J. (2022). Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 320–335).
Gao, L., Madaan, A., Zhou, S., Alon, U., Liu, P., Yang, Y., Callan, J., & Neubig, G. (2023). PAL: Program-aided language models. In International Conference on Machine Learning (pp. 10764–10799). PMLR.
Garcez, A., & Lamb, L. C. (2023). Neurosymbolic AI: The 3rd wave. Artificial Intelligence Review (pp. 1–20).
Ghosal, D., Majumder, N., Mehrish, A., & Poria, S. (2023). Text-to-audio generation using instruction-tuned llm and latent diffusion model. arXiv preprint arXiv:2304.13731
Gou, Z., Shao, Z., Gong, Y., Yang, Y., Huang, M., Duan, N., & Chen, W. ToRA: A tool-integrated reasoning agent for mathematical problem solving. In The Twelfth International Conference on Learning Representations.
Hendy, A., Abdelrehim, M., Sharaf, A., Raunak, V., Gabr, M., Matsushita, H., ... & Awadalla, H. H. (2023). How good are GPT models at machine translation? a comprehensive evaluation. arXiv preprint arXiv:2302.09210.
InternLM Team. (2023). InternLM: A multilingual language model with progressively enhanced capabilities.
Khot, T., Trivedi, H., Finlayson, M., Fu, Y., Richardson, K., Clark, P., & Sabharwal, A. (2022). Decomposed prompting: A modular approach for solving complex tasks. In The 11th International Conference on Learning Representations.
Kung, T. H., Cheatham, M., Medenilla, A., Sillos, C., De Leon, L., Elepaño, C., Madriaga, M., Aggabao, R., Diaz-Candido, G., Maningo, J., et al. (2023). Performance of chatgpt on usmle: Potential for ai-assisted medical education using large language models. PLoS Digital Health, 2(2), e0000198.
Lee, S., & Kim, G. (2023, July). Recursion of thought: A divide-and-conquer approach to multi-context reasoning with language models. In The 61st Annual Meeting Of The Association For Computational Linguistics.
Lewkowycz, A., Andreassen, A., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V., Slone, A., Anil, C., Schlag, I., Gutman-Solo, T., et al. (2022). Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35, 3843–3857.
Li, Y., Lin, Z., Zhang, S., Fu, Q., Chen, B., Lou, J.-G., & Chen, W. (2023). Making language models better reasoners with step-aware verifier. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 5315–5333).
Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., ... & Cobbe, K. (2023). Let’s verify step by step. In The Twelfth International Conference on Learning Representations.
Lin, X. V., Mihaylov, T., Artetxe, M., Wang, T., Chen, S., Simig, D., Ott, M., Goyal, N., Bhosale, S., Du, J., et al. (2021). Few-shot learning with multilingual language models. arXiv preprint arXiv:2112.10668.
Liu, T. & Low, B. K. H. (2023). GOAT: Fine-tuned llama outperforms GPT-4 on arithmetic tasks. arXiv preprint arXiv:2305.14201.
Liu, W., Hu, H., Zhou, J., Ding, Y., Li, J., Zeng, J., He, M., Chen, Q., Jiang, B., Zhou, A., et al. (2023). Mathematical Language Models: A survey. arXiv preprint arXiv:2312.07622.
Luo, H., Sun, Q., Xu, C., Zhao, P., Lou, J., Tao, C., Geng, X., Lin, Q., Chen, S., & Zhang, D. (2023) WizardMath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583
Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., ... & Clark, P. Self-refine: Iterative refinement with self-feedback. In Thirty-seventh Conference on Neural Information Processing Systems.
Medhat, W., Hassan, A., & Korashy, H. (2014). Sentiment analysis algorithms and applications: A survey. Ain Shams Engineering Journal, 5(4), 1093–1113.
Nogueira, R., Jiang, Z., & Lin, J. (2021). Investigating the limitations of transformers with simple arithmetic tasks. arXiv preprint arXiv:2102.13019.
Schick, T., Dwivedi-Yu, J., Dessí, R., Raileanu, R., Lomeli, M., Hambro, E., ... & Scialom, T. (2023, December). Toolformer: language models can teach themselves to use tools. In Proceedings of the 37th International Conference on Neural Information Processing Systems (pp. 68539–68551).
Shah, N. H., Entwistle, D., & Pfeffer, M. A. (2023). Creation and adoption of large language models in medicine. JAMA: Journal of the American Medical Association, 330(9).
Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P. & Hashimoto, T. B. (2023). Stanford alpaca: An instruction-following llama model.
Thirunavukarasu, A. J., Ting, D. S. J., Elangovan, K., Gutierrez, L., Tan, T. F., & Ting, D. S. W. (2023). Large language models in medicine. Nature Medicine, 29(8), 1930–1940.
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S. et al. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
Wang, K., Ren, H., Zhou, A., Lu, Z., Luo, S., Shi, W., ... & Li, H. MathCoder: Seamless code integration in LLMs for enhanced mathematical reasoning. In The Twelfth International Conference on Learning Representations.
Wang, S. I., & Manning, C. D. (2012). Baselines and bigrams: Simple, good sentiment and topic classification. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 90–94).
Wang, X., Chen, Y., & Zhu, W. (2021). A survey on curriculum learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9), 4555–4576.
Wang, Z., Xia, R., & Liu, P. (2023b). Generative ai for math: Part i–mathpile: A billion-token-scale pretraining corpus for math. arXiv preprint arXiv:2312.17120.
Wei, J., Wang, X., Schuurmans, B., Maarten, X., Fei, C. E., Le, Q. V., Zhou, D., et al. (2022). Chain-of-Thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35, 24824–24837.
Weng, Y., Zhu, M., Xia, F., Li, B., He, S., Liu, S., Sun, B., Liu, K., & Zhao, J. (2023). Large language models are better reasoners with self-verification. In Findings of the Association for Computational Linguistics: EMNLP, 2023, 2550–2575.
Wu, S., Irsoy, O., Lu, S., Dabravolski, V., Dredze, M., Gehrmann, S., ... & Mann, G. (2023). BloombergGPT: A large language model for finance (No. 2303.17564).
Yang, A., Xiao, B., Wang, B., Zhang, B., Bian, C., Yin, C., Lv, C., Pan, D., Wang, D., Yan, D., et al. (2023). Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305.
Yu, L., Jiang, W., Shi, H., Jincheng, Y. U., Liu, Z., Zhang, Y., ... & Liu, W (2023). MetaMath: Bootstrap your own mathematical questions for large language models. In The Twelfth International Conference on Learning Representations.
Yue, X., Qu, X., Zhang, G., Fu, Y., Huang, W., Sun, H., ... & Chen, W. (2023). MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning. In The Twelfth International Conference on Learning Representations.
Zeng, A., Liu, X., Du, Z., Wang, Z., Lai, H., Ding, M., Yang, Z., Xu, Y., Zheng, W., Xia, X., et al. (2022). GLM-130B: An open bilingual pre-trained model. In The 11th International Conference on Learning Representations.
Zhan, B., Guo, T., Li, X., Hou, M., Liang, Q., Gao, B., ... & Liu, Z. (2024, July). Knowledge tracing as language processing: A large-scale autoregressive paradigm. In International Conference on Artificial Intelligence in Education (pp. 177–191).
Zhang, Y., Tolmie, A., & Gordon, R. (2022). The relationship between working memory and arithmetic in primary school children: A meta-analysis. Brain Sciences, 13(1), 22.
Zhao, J. X., Xie, Y., Kawaguchi, K., He, J., & Xie, M. Q. (2023). Automatic model selection with large language models for reasoning. In The 2023 Conference on Empirical Methods in Natural Language Processing.
Zheng, Y., Li, X., Huang, Y., Liang, Q., Guo, T., Hou, M., ... & Luo, W. (2024, July). Automatic lesson plan generation via large language models with self-critique prompting. In International Conference on Artificial Intelligence in Education (pp. 16–178). Springer Nature Switzerland.
Zhou, A., Wang, K., Lu, Z., Shi, W., Luo, S., Qin, Z., ... & Li, H. (2024, May). Solving challenging math word problems using GPT-4 code interpreter with code-based self-verification. In 12th International Conference on Learning Representations (ICLR 2024).
Zhou, D., Schärli, N., Hou, L., Wei, J., Scales, N., Wang, X., Schuurmans, D., Cui, C., Bousquet, O., Le, Q. V., et al. (2022). Least-to-most prompting enables complex reasoning in large language models. In The 11th International Conference on Learning Representations.
Acknowledgments
This work was supported in part by National Key R&D Program of China, under Grant No. 2022YFC3303600 and in part by Key Laboratory of Smart Education of Guangdong Higher Education Institutes, Jinan University (2022LSYS003).
Author information
Authors and Affiliations
Contributions
A.B. and C.D.E wrote the main manuscript text and F.G. conducted the experiments and prepared figures 1-4. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Liu, Z., Zheng, Y., Yin, Z. et al. ArithmeticGPT: empowering small-size large language models with advanced arithmetic skills. Mach Learn 114, 24 (2025). https://doi.org/10.1007/s10994-024-06681-1
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10994-024-06681-1