A Survey of LLM Datasets: From Autoregressive Model to AI Chatbot

Du, Fei; Ma, Xin-Jian; Yang, Jing-Ru; Liu, Yi; Luo, Chao-Ran; Wang, Xue-Bin; Jiang, Hai-Ou; Jing, Xiang

doi:10.1007/s11390-024-3767-3

A Survey of LLM Datasets: From Autoregressive Model to AI Chatbot

Survey
Published: 22 July 2024

Volume 39, pages 542–566, (2024)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Fei Du (杜　非)^1,2,3,
Xin-Jian Ma (马新建)^1,2,
Jing-Ru Yang (杨婧如)^1,2,
Yi Liu (柳　熠)^1,2,
Chao-Ran Luo (罗超然)^1,2,
Xue-Bin Wang (王学斌)^1,2,
Hai-Ou Jiang (姜海鸥)^1,2 &
…
Xiang Jing (景　翔)^1,2,4

975 Accesses
1 Altmetric
Explore all metrics

Abstract

Since OpenAI opened access to ChatGPT, large language models (LLMs) become an increasingly popular topic attracting researchers’ attention from abundant domains. However, public researchers meet some problems when developing LLMs given that most of the LLMs are produced by industries and the training details are typically unrevealed. Since datasets are an important setup of LLMs, this paper does a holistic survey on the training datasets used in both the pre-train and fine-tune processes. The paper first summarizes 16 pre-train datasets and 16 fine-tune datasets used in the state-of-the-art LLMs. Secondly, based on the properties of the pre-train and fine-tune processes, it comments on pre-train datasets from quality, quantity, and relation with models, and comments on fine-tune datasets from quality, quantity, and concerns. This study then critically figures out the problems and research trends that exist in current LLM datasets. The study helps public researchers train and investigate LLMs by visual cases and provides useful comments to the research community regarding data development. To the best of our knowledge, this paper is the first to summarize and discuss datasets used in both autoregressive and chat LLMs. The survey offers insights and suggestions to researchers and LLM developers as they build their models, and contributes to the LLM study by pointing out the existing problems of LLM studies from the perspective of data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

References

Bang Y, Cahyawijaya S, Lee N et al. A multitask, multilingual, multimodal evaluation of ChatGPT on reasoning, hallucination, and interactivity. In Proc. the 13th International Joint Conference on Natural Language and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, Nov. 2023, pp.675–718. DOI: https://doi.org/10.18653/v1/2023.ijcnlp-main.45.
Google Scholar
Zhao W X, Zhou K, Li J Y et al. A survey of large language models. arXiv: 2303.18223, 2023. https://arxiv.org/abs/2303.18223, May 2024.
Google Scholar
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł, Polosukhin I. Attention is all you need. In Proc. the 31st International Conference on Neural Information Processing Systems, Dec. 2017, pp.6000–6010.
Google Scholar
Kaplan J, McCandlish S, Henighan T, Brown T B, Chess B, Child R, Gray S, Radford A, Wu J, Amodei D. Scaling laws for neural language models. arXiv: 2001. 08361, 2020. https://arxiv.org/abs/2001.08361, May 2024.
Google Scholar
Xue F Z, Fu Y, Zhou W C S, Zheng Z W, You Y. To repeat or not to repeat: Insights from scaling LLM under token-crisis. arXiv: 2305.13230, 2023. https://arxiv.org.abs/2305.13230, May 2024.
Google Scholar
Bai Y T, Jones A, Ndousse K et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv: 2204.05862, 2022. https://arxiv.org/abs/2204.05862, May 2024.
Google Scholar
Naveed H, Khan A U, Qiu S, Saqib M, Anwar S, Usman M, Akhtar N, Barnes N, Mian A. A comprehensive overview of large language models. arXiv: 2307.06435, 2023. https://arxiv.org/abs/2307.06435, May 2024.
Google Scholar
Hosseini M, Gao C A, Liebovitz D, Carvalho A, Ahmad F S, Luo Y, MacDonald N, Holmes K, Kho A. An exploratory survey about using ChatGPT in education, healthcare, and research. PLOS ONE, 18(10): e0292216. https://doi.org/10.1371/journal.pone.0292216.
Ling C, Zhao X J, Lu J Y et al. Domain specialization as the key to make large language models disruptive: A comprehensive survey. arXiv: 2305.18703, 2023. https://arxiv.org/abs/2305.18703, May 2024.
Google Scholar
Wu L K, Zheng Z, Qiu Z P, Wang H, Gu H C, Shen T J, Qin C, Zhu C, Zhu H S, Liu Q, Xiong H, Chen E H. A survey on large language models for recommendation. arXiv: 2305.19860, 2023. https://arxiv.org/abs/2305.19860, May 2024.
Google Scholar
Wang J J, Huang Y C, Chen C Y, Liu Z, Wang S, Wang Q. Software testing with large language models: Survey, landscape, and vision. arXiv: 2307.07221, 2024. https://arxiv.org/abs/2307.07221, May 2024.
Google Scholar
Kasneci E, Sessler K, Küchemann S et al. ChatGPT for good? On opportunities and challenges of large language models for education. Learning and Individual Differences, 2023, 103: 102274. DOI: https://doi.org/10.1016/j.lindif.2023.102274.
Article Google Scholar
Wang B Y, Xie Q Q, Pei J H, Chen Z H, Tiwari P, Li Z, Fu J. Pre-trained language models in biomedical domain: A systematic survey. ACM Computing Surveys, 2024, 56(3): 55. DOI: https://doi.org/10.1145/3611651.
Article Google Scholar
Chang Y P, Wang X, Wang J D et al. A survey on evaluation of large language models. ACM Trans. Intelligent Systems and Technology, 2024, 15(3): 39. DOI: https://doi.org/10.1145/3641289.
Article Google Scholar
Mohamadi S, Mujtaba G, Le N, Doretto G, Adjeroh D A. ChatGPT in the age of generative AI and large language models: A concise survey. arXiv: 2307.04251, 2023. https://arxiv.org/abs/2307.04251, May 2024.
Google Scholar
Liu Y H, Han T L, Ma S Y et al. Summary of ChatGPT/GPT-4 research and perspective towards the future of large language models. arXiv: 2304.01852, 2023. https://doi.org/https://arxiv.org/abs/2304.01852v1, May 2024.
Google Scholar
Zhang C N, Zhang C S, Li C H et al. One small step for generative AI, one giant leap for AGI: A complete survey on ChatGPT in AIGC era. arXiv: 2304.06488, 2023. https://arxiv.org/abs/2304.06488, May 2024.
Google Scholar
Chen K P, Shao A Q, Burapacheep J, Li YX. How GPT-3 responds to different publics on climate change and black lives matter: A critical appraisal of equity in conversational AI. arXiv: 2209.13627, 2023. https://arxiv.org/abs/2209.13627, 2024.
Google Scholar
Zong M Y, Krishnamachari B. A survey on GPT-3. arXiv: 2212.00857, 2022. https://arxiv.org/abs/2212.00857, May 2024.
Google Scholar
Wang H, Hee M S, Awal M R, Choo K T W, Lee R K W. Evaluating GPT-3 generated explanations for hateful content moderation. In Proc. the 32nd International Joint Conference on Artificial Intelligence, Aug. 2023, Article No. 694. DOI: https://doi.org/10.24963/ijcai.2023/694.
Google Scholar
Fernandes P, Madaan A, Liu E et al. Bridging the gap: A survey on integrating (Human) feedback for natural language generation. Trans. Association for Computational Linguistics, 2023, 11: 1643–1668. DOI: https://doi.org/10.1162/tacl_a_00626.
Article Google Scholar
De Angelis L, Baglivo F, Arzilli G, Privitera G P, Ferragina P, Tozzi A E, Rizzo C. ChatGPT and the rise of large language models: The new AI-driven infodemic threat in public health. Frontiers in Public Health, 2023, 11: 1166120. DOI: https://doi.org/10.3389/fpubh.2023.1166120.
Article Google Scholar
Dillion D, Tandon N, Gu Y L, Gray K. Can AI language models replace human participants? Trends in Cognitive Sciences, 2023, 27(7): 597–600. DOI: https://doi.org/10.1016/j.tics.2023.04.008.
Article Google Scholar
Egli A. ChatGPT, GPT-4, and other large language models: The next revolution for clinical microbiology? Clinical Infectious Diseases, 2023, 77(9): 1322–1328. DOI: https://doi.org/10.1093/cid/ciad407.
Article Google Scholar
Weidinger L, Mellor J, Rauh M et al. Ethical and social risks of harm from language models. arXiv: 2112.04359, 2021. https://arxiv.org/abs/2112.04359, May 2024.
Google Scholar
Bender E M, Gebru T, McMillan-Major A, Shmitchell S. On the dangers of stochastic parrots: Can language models be too big? In Proc. the 2021 ACM Conference on Fairness, Accountability, and Transparency, Mar. 2021, pp.610–623. DOI: https://doi.org/10.1145/3442188.3445922.
Chapter Google Scholar
Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners. 2019. https://openai.com/index/better-languagemodels/, May 2024.
Google Scholar
Devlin J, Chang M W, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proc. the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Jun. 2019, pp.4171–4186. DOI: https://doi.org/10.18653/v1/N19-1423.
Google Scholar
Du Z X, Qian Y J, Liu X, Ding M, Qiu J Z, Yang Z L, Tang J. GLM: General language model pretraining with autoregressive blank infilling. In Proc. the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), May 2022, pp.320–335. DOI: https://doi.org/10.18653/v1/2022.acl-long.26.
Brown T B, Mann B, Ryder N et al. Language models are few-shot learners. In Proc. the 34th International Conference on Neural Information Processing Systems, Dec. 2020, Article No. 159.
Google Scholar
Ouyang L, Wu J, Jiang X et al. Training language models to follow instructions with human feedback. In Proc. the 36th International Conference on Neural Information Processing Systems, Nov. 2022, Article No. 2011.
Google Scholar
Thoppilan R, De Freitas D, Hall J et al. LaMDA: Language models for dialog applications. arXiv: 2201.08239, 2022. https://arxiv.org/abs/2201.08239, May 2024.
Google Scholar
OpenAI. GPT-4 technical report. arXiv: 2303.08774, 2023. https://arxiv.org/abs/2303.08774, May 2024.
Google Scholar
The Vicuna Team. Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality, 2023. https://lmsys.org/blog/2023-03-30-vicuna/, June 2024.
Google Scholar
Xu D K, Yen I E H, Zhao J X, Xiao Z B. Rethinking network pruning -Under the pre-train and fine-tune paradigm. In Proc. the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Jun. 2021, pp.2376–2382. DOI: https://doi.org/10.18653/v1/2021.naacl-main.188.
Google Scholar
BigScience Workshop. BLOOM: A 176B-parameter openaccess multilingual language model. arXiv: 2211.05100, 2023. https://arxiv.org/abs/2211.05100, May 2024.
Google Scholar
Nobata C, Tetreault J, Thomas A, Mehdad Y, Chang Y. Abusive language detection in online user content. In Proc. the 25th International Conference on World Wide Web, Apr. 2016, pp.145–153. DOI: https://doi.org/10.1145/2872427.2883062.
Chapter Google Scholar
Du N, Huang Y P, Dai A M et al. GLaM: Efficient scaling of language models with mixture-of-experts. In Proc. the 39th International Conference on Machine Learning, Jul. 2022, pp.5547–5569.
Google Scholar
Touvron H, Lavril T, Izacard G et al. LLaMA: Open and efficient foundation language models. arXiv: 2302.13971, 2023. https://arxiv.org/abs/2302.13971, May 2024.
Google Scholar
Rae J W, Borgeaud S, Cai T et al. Scaling language models: Methods, analysis & insights from training gopher. arXiv: 2112.11446, 2022. https://arxiv.org/abs/2112. 11446, May 2024.
Lee K, Ippolito D, Nystrom A, Zhang C Y, Eck D, Callison-Burch C, Carlini N. Deduplicating training data makes language models better. In Proc. the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), May 2022, pp.8424–8445. DOI: https://doi.org/10.18653/v1/2022.acl-long.577.
Chapter Google Scholar
Frank M C. Bridging the data gap between children and large language models. Trends in Cognitive Sciences, 2023, 27(11): 990–992. DOI: https://doi.org/10.1016/j.tics.2023.08.007.
Article Google Scholar
Shin S, Lee S W, Ahn H, Kim S, Kim H, Kim B, Cho K, Lee G, Park W, Ha J W, Sung N. On the effect of pretraining corpora on in-context learning by a large-scale language model. In Proc. the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Jul. 2022, pp.5168–5186. DOI: https://doi.org/10.18653/v1/2022.naacl-main.380.
Google Scholar
Adiwardana D, Luong M T, So D R, Hall J, Fiedel N, Thoppilan R, Yang Z, Kulshreshtha A, Nemade G, Lu Y F, Le Q V. Towards a human-like open-domain chatbot. arXiv: 2001.09977, 2020. https://arxiv.org/abs/2001.09977, May 2024.
Google Scholar
Peng B L, Li C Y, He P C, Galley M, Gao J F. Instruction tuning with GPT-4. arXiv: 2304.03277, 2023. https://arxiv.org/abs/2304.03277, May 2024.
Google Scholar
McKenna N, Li T Y, Cheng L, Hosseini M, Johnson M, Steedman M. Sources of hallucination by large language models on inference tasks. In Proc. the 2023 Findings of the Association for Computational Linguistics, Dec. 2023, pp.2758–2774. DOI: https://doi.org/10.18653/v1/2023.findings-emnlp.182.
Chapter Google Scholar
Deshpande A, Murahari V, Rajpurohit T, Kalyan A, Narasimhan K. Toxicity in chatgpt: Analyzing persona-assigned language models. In Proc. the 2023 Findings of the Association for Computational Linguistics, Dec. 2023, pp.1236–1270. DOI: https://doi.org/10.18653/v1/2023.findings-emnlp.88.
Chapter Google Scholar
Amatriain X, Sankar A, Bing J, Bodigutla P K, Hazen T J, Kazi M. Transformer models: An introduction and catalog. arXiv: 2302.07730, 2023. https://arxiv.org/abs/2302.07730, May 2024.
Google Scholar
Chowdhery A, Narang S, Devlin J et al. PaLM: Scaling language modeling with pathways. The Journal of Machine Learning Research, 2023, 24(1): 240. DOI: https://doi.org/10.5555/3648699.3648939.
Google Scholar
Hoffmann J, Borgeaud S, Mensch A et al. Training compute-optimal large language models. In Proc. the 36th International Conference on Neural Information Processing Systems, Nov. 2024, Article No. 2176. DOI: https://doi.org/10.5555/3600270.3602446.
Google Scholar
Wang Y Z, Kordi Y, Mishra S, Liu A, Smith N A, Khashabi D, Hajishirzi H. Self-Instruct: Aligning language models with self-generated instructions. In Proc. the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Jul. 2023, pp.13484–13508. DOI: https://doi.org/10.18653/v1/2023.acl-long.754.
Chapter Google Scholar
Gao L, Biderman S, Black S, Golding L, Hoppe T, Foster C, Phang J, He H, Thite A, Nabeshima N, Presser S, Leahy C. The pile: An 800GB dataset of diverse text for language modeling. arXiv: 2101.00027, 2020. https://arxiv.org/abs/2101.00027, May 2024.
Google Scholar
Laurençon H, Saulnier L, Wang T et al. The BigScience ROOTS corpus: A 1.6TB composite multilingual dataset. In Proc. the 36th International Conference on Neural Information Processing Systems, Nov. 2022, Article No. 2306. DOI: https://doi.org/10.5555/3600270.3602576.
Google Scholar
Huang G. Network of data: Digital infrastructure. Communication of the CCF, 2021(12): 58–60. (in Chinese)
Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y Q, Li W, Liu P J. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 2020, 21(1): 140.
MathSciNet Google Scholar
Rae J W, Potapenko A, Jayakumar S M, Lillicrap T P. Compressive transformers for long-range sequence modelling. arXiv: 1911.05507, 2019. https://arxiv.org/abs/1911.05507, May 2024.
Google Scholar
Wenzek G, Lachaux M A, Conneau A, Chaudhary V, Guzmán F, Joulin A, Grave E. CCNet: Extracting high quality monolingual datasets from web crawl data. In Proc. the 12th Language Resources and Evaluation Conference, May 2020, pp.4003–4012.
Google Scholar
Penedo G, Malartic Q, Hesslow D, Cojocaru R, Alobeidli H, Cappelli A, Pannier B, Almazrouei E, Launay J. The refinedweb dataset for falcon LLM: Outperforming curated corpora with web data only. In Proc. the 37th International Conference on Neural Information Processing Systems, Dec. 2023, Article No. 3464. DOI: https://doi.org/10.5555/3666122.3669586.
Google Scholar
Lee A, Miranda B, Sundar S, Koyejo S. Beyond scale: The diversity coefficient as a data quality metric demonstrates LLMs are pre-trained on formally diverse data. arXiv: 2306.13840, 2023. https://arxiv.org/abs/2306.13840, May 2024.
Lee K, Chang M W, Toutanova K. Latent retrieval for weakly supervised open domain question answering. In Proc. the 57th Annual Meeting of the Association for Computational Linguistics, Jul. 2019, pp.6086–6096. DOI: https://doi.org/10.18653/v1/P19-1612.
Chapter Google Scholar
Yuan S, Zhao H Y, Du Z X, Ding M, Liu X, Cen Y K, Zou X, Yang Z L, Tang J. WuDaoCorpora: A super large-scale Chinese corpora for pre-training language models. AI Open, 2021, 2: 65–68. DOI: https://doi.org/10.1016/j.aiopen.2021.06.001.
Article Google Scholar
El-Khair I A. 1.5 billion words arabic corpus. arXiv: 1611.04033, 2016. https://arxiv.org/abs/1611.04033, May 2024.
Google Scholar
Kakwani D, Kunchukuttan A, Golla S, Gokul N C, Bhattacharyya A, Khapra M M, Kumar P. IndicNLPSuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian lan-guages. In Proc. the 2020 Findings of the Association for Computational Linguistics, Nov. 2020, pp.4948–4961. DOI: https://doi.org/10.18653/v1/2020.findings-emnlp.445.
Chapter Google Scholar
Armengol-Estapé J, Carrino C P, Rodriguez-Penagos C, De Gibert Bonet O, Armentano-Oller C, Gonzalez-Agirre A, Melero M, Villegas M. Are multilingual models the best choice for moderately under-resourced languages? A comprehensive assessment for Catalan. In Proc. the 2021 Findings of the Association for Computational Linguistics, Aug. 2021, pp.4933–4946. DOI: https://doi.org/10.18653/v1/2021.findings-acl.437.
Chapter Google Scholar
Wei J, Wang X Z, Schuurmans D, Bosma M, Ichter B, Xia F, Chi E H, Le Q V, Zhou D. Chain-of-thought prompting elicits reasoning in large language models. In Proc. the 36th International Conference on Neural Information Processing Systems, Nov. 2022, Article No. 1800.
Google Scholar
Rajpurkar P, Zhang J, Lopyrev K, Liang P. SQuAD: 100, 000+ questions for machine comprehension of text. In Proc. the 2016 Conference on Empirical Methods in Natural Language Processing, Nov. 2016, pp.2383–2392. DOI: https://doi.org/10.18653/v1/D16-1264.
Chapter Google Scholar
Wang A, Singh A, Michael J, Hill F, Levy O, Bowman S. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proc. the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Nov. 2018, pp.353–355. DOI: https://doi.org/10.18653/v1/W18-5446.
Chapter Google Scholar
Lin S, Hilton J, Evans O. TruthfulQA: Measuring how models mimic human falsehoods. arXiv: 2109.07958, 2022. https://arxiv.org/abs/2109.07958, May 2024.
Google Scholar
Gehman S, Gururangan S, Sap M, Choi Y, Smith N A. RealToxicityPrompts: Evaluating neural toxic degeneration in language models. arXiv: 2009.11462, 2020. https://arxiv.org/abs/2009.11462, May 2024.
Google Scholar
Zheng L M, Chiang W L, Sheng Y, Zhuang S Y, Wu Z H, Zhuang Y H, Lin Z, Li Z H, Li D C, Xing E P, Zhang H, Gonzalez J E, Stoica I. Judging LLM-as-a-judge with MT-bench and Chatbot Arena. arXiv: 2306.05685, 2023. https://arxiv.org/abs/2306.05685, May 2024.
Google Scholar
Kandpal N, Deng H K, Roberts A, Wallace E, Raffel C. Large language models struggle to learn long-tail knowledge. arXiv: 2211.08411, 2022. https://arxiv.org/abs/2211.08411, May 2024.
Google Scholar
Razeghi Y, Logan IV R L, Gardner M, Singh S. Impact of pretraining term frequencies on few-shot reasoning. arXiv: 2202.07206, 2022. https://arxiv.org/abs/2202.07206, May 2024.
Book Google Scholar
Xiao L, Chen X L. Enhancing LLM with evolutionary fine tuning for news summary generation. arXiv: 2307. 02839, 2023. https://arxiv.org/abs/2307.02839, May 2024.
Google Scholar
Zhang T Y, Ladhak F, Durmus E, Liang P, McKeown K, Hashimoto T B. Benchmarking large language models for news summarization. arXiv: 2301.13848, 2023. https://arxiv.org/abs/2301.13848, May 2024.
Zhu Q, Huang K L, Zhang Z, Zhu X Y, Huang M L. CrossWOZ: A large-scale Chinese cross-domain task-oriented dialogue dataset. Trans. Association for Computational Linguistics, 2020, 8: 281–295. DOI: https://doi.org/10.1162/tacl_a_00314.
Article Google Scholar
Qi L, Lv S W, Li H Y, Liu J, Zhang Y, She Q Q, Wu H, Wang H F, Liu T. DuReadervis: A Chinese dataset for open-domain document visual question answering. In Proc. the 2022 Findings of the Association for Computational Linguistics, May 2022, pp.1338–1351. DOI: https://doi.org/10.18653/v1/2022.findings-acl.105.
Chapter Google Scholar
Zhang J Y, Panthaplackel S, Nie P Y, Li J J, Gligoric M. CoditT5: Pretraining for source code and natural language editing. In Proc. the 37th IEEE/ACM International Conference on Automated Software Engineering, Oct. 2023, Article No. 22. DOI: https://doi.org/10.1145/3551349.3556955.
Google Scholar
Le H, Wang Y, Gotmare A D, Savarese S, Hoi S C H. CodeRL: Mastering code generation through pretrained models and deep reinforcement learning. In Proc. the 36th International Conference on Neural Information Processing Systems, Nov. 2022, Article No. 1549. DOI: https://doi.org/10.5555/3600270.3601819.
Google Scholar
Hendrycks D, Burns C, Kadavath S, Arora A, Basart S, Tang E, Song D, Steinhardt J. Measuring mathematical problem solving with the math dataset. arXiv: 2103. 03874, 2021. https://arxiv.org/abs/2103.03874, May 2024.
Google Scholar
Lu P, Qiu L, Chang K W, Wu Y N, Zhu S C, Rajpurohit T, Clark P, Kalyan A. Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning. arXiv: 2209.14610, 2023. https://arxiv.org/abs/2209.14610, May 2024.
Google Scholar
Liu J Y, Huang Z Y, Zhai C X, Liu Q. Learning by applying: A general framework for mathematical reasoning via enhancing explicit knowledge learning. In Proc. the 37th AAAI Conference on Artificial Intelligence, Jun. 2023, pp.4497–4506.
Google Scholar
Wang H M, Xin H J, Zheng C Y, Li L, Liu Z Y, Cao Q X, Huang Y Y, Xiong J, Shi H, Xie E Z, Yin J, Li Z G, Liao H, Liang X D. LEGO-Prover: Neural theorem proving with growing libraries. arXiv: 2310.00656, 2023. https://arxiv.org/abs/2310.00656, May 2024.
Google Scholar
Wang K, Ren H X, Zhou A J, Lu Z M, Luo S C, Shi W K, Zhang R R, Song L Q, Zhan M J, Li H S. Math-Coder: Seamless code integration in LLMs for enhanced mathematical reasoning. arXiv: 2310.03731, 2023. https://arxiv.org/abs/2310.03731, May 2024.
Google Scholar
Trinh T H, Wu Y H, Le Q V, He H, Luong T. Solving olympiad geometry without human demonstrations. Nature, 2024, 625(7995): 476–482. DOI: https://doi.org/10.1038/s41586-023-06747-5.
Article Google Scholar
Hendrycks D, Burns C, Basart S, Zou A, Mazeika M, Song D, Steinhardt J. Measuring massive multitask language understanding. arXiv: 2009.03300, 2021. https://arxiv.org/abs/2009.03300, May 2024.
Google Scholar
Snæbjarnarson V, Símonarson H B, Ragnarsson P O, Ingólfsdóttir S L, Jónsson H P, Ƥorsteinsson V, Einarsson H. A warm start and a clean crawled corpus — A recipe for good language models. arXiv: 2201.05601, 2022. https://arxiv.org/abs/2201.05601, May 2024.
Google Scholar
Ngo H, Raterink C, Araújo J G M, Zhang I, Chen C, Morisot A, Frosst N. Mitigating harm in language models with conditional-likelihood filtration. arXiv: 2108. 07790, 2021. https://arxiv.org/abs/2108.07790, May 2024.
Google Scholar
Zhang S S, Roller S, Goyal N, Artetxe M, Chen M, Chen S H, Dewan C, Diab M, Li X, Lin X V, Mihaylov T, Ott M, Shleifer S, Shuster K, Simig D, Koura P S, Sridhar A, Wang T L, Zettlemoyer L. OPT: Open pre-trained transformer language models. arXiv: 2205.01068, 2022. https://arxiv.org/abs/2205.01068, May 2024.
Google Scholar
Wang X, Zhou W K, Zhang Q, Zhou J, Gao S Y, Wang J Z, Zhang M H, Gao X, Chen Y W, Gui T. Farewell to aimless large-scale pretraining: Influential subset selection for language model. arXiv: 2305.12816, 2023. https://arxiv.org/abs/2305.12816, May 2024.
Kwiatkowski T, Palomaki J, Redfield O et al. Natural questions: A benchmark for question answering research. Trans. Association for Computational Linguistics, 2019, 7: 453–466. DOI: https://doi.org/10.1162/tacl_a_00276.
Article Google Scholar
Pérez-Mayos L, Ballesteros M, Wanner L. How much pretraining data do language models need to learn syntax? arXiv: 2109.03160, 2021. https://arxiv.org/abs/2109.03160, May 2024.
Book Google Scholar
Ding N, Chen Y L, Xu B K, Qin Y J, Zheng Z, Hu S D, Liu Z Y, Sun M S, Zhou B W. Enhancing chat language models by scaling high-quality instructional conversations. arXiv: 2305.14233, 2023. https://arxiv.org/abs/2305.14233, May 2024.
Book Google Scholar
Bach S H, Sanh V, Yong Z X et al. PromptSource: An integrated development environment and repository for natural language prompts. arXiv: 2202.01279, 2022. https://arxiv.org/abs/2202.01279, May 2024.
Google Scholar
Wang Y Z, Mishra S, Alipoormolabashi P et al. Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks. arXiv: 2204.07705, 2022. https://arxiv.org/abs/2204.07705, May 2024.
Google Scholar
Longpre S, Hou L, Vu T, Webson A, Chung H W, Tay Y, Zhou D, Le Q V, Zoph B, Wei J, Roberts A. The flan collection: Designing data and methods for effective instruction tuning. arXiv: 2301.13688, 2023. https://arxiv.org/abs/2301.13688, May 2024.
Google Scholar
Wei J, Bosma M, Zhao V Y, Guu K, Yu A W, Lester B, Du N, Dai A M, Le Q V. Finetuned language models are zero-shot learners. arXiv: 2109.01652, 2022. https://arxiv.org/abs/2109.01652, May 2024.
Google Scholar
Mishra S, Khashabi D, Baral C, Hajishirzi H. Cross-task generalization via natural language crowdsourcing instructions. arXiv: 2104.08773, 2022. https://arxiv.org/abs/2104.08773, May 2024.
Book Google Scholar
Ji J M, Liu M, Dai J T, Pan X H, Zhang C, Bian C, Zhang C, Sun R Y, Wang Y Z, Yang Y D. BeaverTails: Towards improved safety alignment of LLM via a human-preference dataset. arXiv: 2307.04657, 2023. https://arxiv.org/abs/2307.04657, May 2024.
Google Scholar
Köpf A, Kilcher Y, von Rütte D et al. OpenAssistant conversations -Democratizing large language model alignment. arXiv: 2304.07327, 2023. https://arxiv.org/abs/2304.07327, May 2024.
Google Scholar
Zhang J, Wu X D, Sheng V S. Learning from crowdsourced labeled data: A survey. Artificial Intelligence Review, 2016, 46(4): 543–576. DOI: https://doi.org/10.1007/s10462-016-9491-9.
Article Google Scholar
Ramamurthy R, Ammanabrolu P, Brantley K, Hessel J, Sifa R, Bauckhage C, Hajishirzi H, Choi Y. Is reinforcement learning (not) for natural language processing: Benchmarks, baselines, and building blocks for natural language policy optimization. arXiv: 2210.01241, 2023. https://arxiv.org/abs/2210.01241, May 2024.
Google Scholar
Sun Z Q, Shen Y K, Zhou Q H, Zhang H X, Chen Z F, Cox D, Yang Y M, Gan C. Principle-driven self-alignment of language models from scratch with minimal human supervision. In Proc. the 37th International Conference on Neural Information Processing Systems, Dec. 2023, Article No. 115.
Google Scholar
Dettmers T, Pagnoni A, Holtzman A, Zettlemoyer L. QLoRA: Efficient finetuning of quantized LLMs. arXiv: 2305.14314, 2023. https://arxiv.org/abs/2305.14314, May 2024.
Google Scholar
Gudibande A, Wallace E, Snell C, Geng X Y, Liu H, Abbeel P, Levine S, Song D. The false promise of imitating proprietary LLMs. arXiv: 2305.15717, 2023. https://arxiv.org/abs/2305.15717, May 2024.
Google Scholar
Kim S, Bae S, Shin J, Kang S, Kwak D, Yoo K M, Seo M. Aligning large language models through synthetic feedback. arXiv: 2305.13735, 2023. https://arxiv.org/abs/2305.13735, May 2024.
Book Google Scholar
Madaan A, Tandon N, Gupta P et al. SELF-REFINE: Iterative refinement with self-feedback. In Proc. the 37th International Conference on Neural Information Processing Systems, Dec. 2023, Article No. 2019.
Google Scholar
Weng Y X, Zhu M J, Xia F, Li B, He S Z, Liu S P, Sun B, Liu K, Zhao J. Large language models are better reasoners with self-verification. arXiv: 2212.09561, 2023. https://arxiv.org/abs/2212.09561, May 2024.
Book Google Scholar
Yin Z Y, Sun Q S, Guo Q P, Wu J W, Qiu X P, Huang X J. Do large language models know what they don’t know? arXiv: 2305.18153, 2023. https://arxiv.org/abs/2305.18153, May 2024.
Book Google Scholar
Wang P Y, Li L, Chen L, Cai Z F, Zhu D W, Lin B H, Cao Y B, Liu Q, Liu T Y, Sui Z F. Large language models are not fair evaluators. arXiv: 2305.17926, 2023. https://arxiv.org/abs/2305.17926, May 2024.
Reynolds L, McDonell K. Prompt programming for large language models: Beyond the few-shot paradigm. arXiv: 2102.07350, 2021. https://arxiv.org/abs/2102.07350, May 2024.
Google Scholar
Dang H, Mecke L, Lehmann F, Goller S, Buschek D. How to prompt? Opportunities and challenges of zero-and few-shot learning for human-AI interaction in creative applications of generative models. arXiv: 2209. 01390, 2022. https://arxiv.org/abs/2209.01390, May 2024.
Google Scholar
Zhang S Y, Dong L F, Li X Y, Zhang S, Sun X F, Wang S H, Li J W, Hu R Y, Zhang T W, Wu F, Wang G Y. Instruction tuning for large language models: A survey. arXiv: 2308.10792, 2023. https://arxiv.org/abs/2307.04657, May 2024.
Xu C, Sun Q F, Zheng K, Geng X B, Zhao P, Feng J Z, Tao C Y, Jiang D X. WizardLM: Empowering large language models to follow complex instructions. arXiv: 2304.12244, 2023. https://arxiv.org/abs/2304.12244, May 2024.
Chung J, Kamar E, Amershi S. Increasing diversity while maintaining accuracy: Text data generation with large language models and human interventions. In Proc. the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Jul. 2023, pp.575–593. DOI: https://doi.org/10.18653/v1/2023.acl-long.34.
Chapter Google Scholar
Howard J, Ruder S. Universal language model fine-tuning for text classification. arXiv: 1801.06146, 2018. https://doi.org/https://arxiv.org/abs/1801.06146, May 2024.
Book Google Scholar
Mehrafarin H, Rajaee S, Pilehvar M T. On the importance of data size in probing fine-tuned models. arXiv: 2203.09627, 2022. https://arxiv.org/abs/2203.09627, May 2024.
Book Google Scholar
Chung H W, Hou L, Longpre S et al. Scaling instructionfinetuned language models. arXiv: 2210.11416, 2022. https://arxiv.org/abs/2210.11416, May 2024.
Google Scholar
Xue L T, Constant N, Roberts A, Kale M, Al-Rfou R, Siddhant A, Barua A, Raffel C. mT5: A massively multilingual pre-trained text-to-text transformer. arXiv: 2010.11934, 2021. https://arxiv.org/abs/2010.11934, May 2024.
Lai V D, Ngo N T, Veyseh A P B, Man H, Dernoncourt F, Bui T, Nguyen T H. ChatGPT beyond English: Towards a comprehensive evaluation of large language models in multilingual learning. arXiv: 2304.05613, 2023. https://arxiv.org/abs/2304.05613, May 2024.
Google Scholar
Xu L, Zhang X W, Dong Q Q. CLUECorpus2020: A large-scale Chinese corpus for pre-training language model. arXiv: 2003.01355, 2020. https://arxiv.org/abs/2003.01355, May 2024.
Google Scholar
Gu Y, Tinn R, Cheng H, Lucas M, Usuyama N, Liu X D, Naumann T, Gao J F, Poon H. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Computing for Healthcare, 2022, 3(1): 2. DOI: https://doi.org/10.1145/3458754.
Article Google Scholar
Ren X Z, Zhou P Y, Meng X F et al. PanGu-.: Towards trillion parameter language model with sparse heterogeneous computing. arXiv: 2303.10845, 2023. https://arxiv.org/abs/2303.10845, May 2024.
Google Scholar
Zhang R R, Han J M, Liu C, Gao P, Zhou A J, Hu X F, Yan S, Lu P, Li H S, Qiao Y. LLaMA-Adapter: Efficient fine-tuning of language models with zero-init attention. arXiv: 2303.16199, 2023. https://arxiv.org/abs/2303.16199, May 2024.
Google Scholar
Jiao W X, Huang J T, Wang W X, He Z W, Liang T, Wang X, Shi S M, Tu Z P. ParroT: Translating during chat using large language models tuned with human translation and feedback. arXiv: 2304.02426, 2023. https://doi.org/https://arxiv.org/abs/2304.02426, May 2024.
Google Scholar
Xie Q Q, Han W G, Zhang X, Lai Y Z, Peng M, Lopez-Lira A, Huang J M. PIXIU: A large language model, instruction data and evaluation benchmark for finance. arXiv: 2306.05443,2023. https://doi.org/https://arxiv.org/abs/2306.05443, May 2024.
Google Scholar
Wang H C, Liu C, Xi N W, Qiang Z W, Zhao S D, Qin B, Liu T. HuaTuo: Tuning LLaMA model with Chinese medical knowledge. arXiv: 2304.06975, 2023. https://arxiv.org/abs/2304.06975, May 2024.
Bowman S R. Eight things to know about large language models. arXiv: 2304.00612, 2023. https://arxiv.org/abs/2304.00612, May 2024.
Google Scholar
Wang Y Z, Ivison H, Dasigi P, Hessel J, Khot T, Chandu K R, Wadden D, MacMillan K, Smith N A, Beltagy I, Hajishirzi H. How far can camels go? Exploring the state of instruction tuning on open resources. arXiv: 2306.04751, 2023. https://arxiv.org/abs/2306.04751, May 2024.
Google Scholar
Shi X M, Xu J, Ding J R, Pang J L, Liu S C, Luo S Q, Peng X W, Lu L, Yang H H, Hu M T, Ruan T, Zhang S T. LLM-Mini-CEX: Automatic evaluation of large language model for diagnostic conversation. arXiv: 2308. 07635, 2023. https://arxiv.org/abs/2308.07635, May 2024.
Google Scholar
Ganguli D, Lovitt L, Kernion J et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv: 2209.07858, 2022. http://export.arxiv.org/abs/2209.07858v2, May 2024.
Google Scholar
Rillig M C, Ågerstrand M, Bi M, Gould K A, Sauerland U. Risks and benefits of large language models for the environment. Environmental Science & Technology, 2023, 57(9): 3464–3466. DOI: https://doi.org/10.1021/acs.est.3c01106.
Article Google Scholar
Anand Y, Nussbaum Z, Duderstadt B, Schmidt B, Mulyar A. GPT4All: Training an assistant-style chatbot with large scale data distillation from GPT-3.5-Turbo. Technical Report, 2023. https://s3.amazonaws.com/static.nomic.ai/gpt4all/2023_GPT4All_Technical_Report.pdf, May 2024.
Google Scholar
Li C Y, Wong C, Zhang S, Usuyama N, Liu H T, Yang J W, Naumann T, Poon H, Gao J F. LLaVA-Med: Training a large language-and-vision assistant for biomedicine in one day. arXiv: 2306.00890, 2023. https://arxiv.org/abs/2306.00890, May 2024.
Google Scholar

Download references

Conflict of Interest

The authors declare that they have no conflict of interest.

Author information

Authors and Affiliations

National Key Laboratory of Data Space Technology and System, Beijing, 100195, China
Fei Du (杜　非), Xin-Jian Ma (马新建), Jing-Ru Yang (杨婧如), Yi Liu (柳　熠), Chao-Ran Luo (罗超然), Xue-Bin Wang (王学斌), Hai-Ou Jiang (姜海鸥) & Xiang Jing (景　翔)
Advanced Institute of Big Data, Beijing, 100195, China
Fei Du (杜　非), Xin-Jian Ma (马新建), Jing-Ru Yang (杨婧如), Yi Liu (柳　熠), Chao-Ran Luo (罗超然), Xue-Bin Wang (王学斌), Hai-Ou Jiang (姜海鸥) & Xiang Jing (景　翔)
Fu Foundation School of Engineering and Applied Science, Columbia University, NY, 10027, USA
Fei Du (杜　非)
School of Software and Microelectronics, Peking University, Beijing, 100091, China
Xiang Jing (景　翔)

Authors

Fei Du (杜　非)
View author publications
You can also search for this author inPubMed Google Scholar
Xin-Jian Ma (马新建)
View author publications
You can also search for this author inPubMed Google Scholar
Jing-Ru Yang (杨婧如)
View author publications
You can also search for this author inPubMed Google Scholar
Yi Liu (柳　熠)
View author publications
You can also search for this author inPubMed Google Scholar
Chao-Ran Luo (罗超然)
View author publications
You can also search for this author inPubMed Google Scholar
Xue-Bin Wang (王学斌)
View author publications
You can also search for this author inPubMed Google Scholar
Hai-Ou Jiang (姜海鸥)
View author publications
You can also search for this author inPubMed Google Scholar
Xiang Jing (景　翔)
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding authors

Correspondence to Xin-Jian Ma (马新建) or Xiang Jing (景　翔).

Additional information

Fei Du received her B.S. degrees in mathematical science and data science from University of California, Santa Barbara. She is currently a master student at Columbia University in New York City, and has interned in the National Key Laboratory of Data Space Technology and System and the Advanced Institute of Big Data, Beijing. Her research interests include machine learning, NLP, deep learning, and statistical data analysis.

Xin-Jian Ma received his Ph.D. degree in information security from the Institute of Information Engineering, Chinese Academy of Sciences, Beijing, in 2017. He is currently an associate researcher at Advanced Institute of Big Data, Beijing. His research interests include big data, distributed system, and information security.

Jing-Ru Yang received her Ph.D. degree in computer science and technology from Renmin University of China, Beijing. She completed her postdoctoral research with funding from the Boya Program at Peking University, Beijing, and is currently a research associate of the National Key Laboratory of Dataspace Technology and System, Beijing. Her research focuses on data governance technology and systems.

Yi Liu received his Ph.D. degree in software engineering from Peking University, Beijing, in 2019. He is currently an associate researcher at Advanced Institute of Big Data, Beijing. His research interests include serverless computing and service computing.

Chao-Ran Luo received his Ph.D. degree in software engineering from Peking University, Beijing. He is currently a research associate at Advanced Institute of Big Data, Beijing. His research focuses on data space, internet of data, and digital object architecture

Xue-Bin Wang received his Ph.D. degree in computer science from National University of Defence Technology, Changsha, in 2007. He is currently a senior engineer at Advanced Institute of Big Data, Beijing. His research interests are big data and artificial intelligence.

Hai-Ou Jiang received her Ph.D. degree in computer science from Beijing University of Posts and Telecommunications, Beijing, in 2017. She is currently a research associate at Advanced Institute of Big Data, Beijing. Her research interests include cloud computing, big data, and machine learning.

Xiang Jing received his Ph.D. degree in information security from the Institute of Information Engineering, Chinese Academy of Sciences, Beijing, in 2016. He finished his postdoctoral program in computer theory from Software Engineering Institute at Peking University, Beijing. He is currently a research associate in the School of Software and Microelectronics at Peking University, Beijing. His research focuses on operating system, big data, blockchain technology, and industrial internet.

Electronic Supplementary Material