Skip to main content

Advertisement

Log in

A Survey of LLM Datasets: From Autoregressive Model to AI Chatbot

  • Survey
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

Since OpenAI opened access to ChatGPT, large language models (LLMs) become an increasingly popular topic attracting researchers’ attention from abundant domains. However, public researchers meet some problems when developing LLMs given that most of the LLMs are produced by industries and the training details are typically unrevealed. Since datasets are an important setup of LLMs, this paper does a holistic survey on the training datasets used in both the pre-train and fine-tune processes. The paper first summarizes 16 pre-train datasets and 16 fine-tune datasets used in the state-of-the-art LLMs. Secondly, based on the properties of the pre-train and fine-tune processes, it comments on pre-train datasets from quality, quantity, and relation with models, and comments on fine-tune datasets from quality, quantity, and concerns. This study then critically figures out the problems and research trends that exist in current LLM datasets. The study helps public researchers train and investigate LLMs by visual cases and provides useful comments to the research community regarding data development. To the best of our knowledge, this paper is the first to summarize and discuss datasets used in both autoregressive and chat LLMs. The survey offers insights and suggestions to researchers and LLM developers as they build their models, and contributes to the LLM study by pointing out the existing problems of LLM studies from the perspective of data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Explore related subjects

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

References

  1. Bang Y, Cahyawijaya S, Lee N et al. A multitask, multilingual, multimodal evaluation of ChatGPT on reasoning, hallucination, and interactivity. In Proc. the 13th International Joint Conference on Natural Language and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, Nov. 2023, pp.675–718. DOI: https://doi.org/10.18653/v1/2023.ijcnlp-main.45.

    Google Scholar 

  2. Zhao W X, Zhou K, Li J Y et al. A survey of large language models. arXiv: 2303.18223, 2023. https://arxiv.org/abs/2303.18223, May 2024.

    Google Scholar 

  3. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł, Polosukhin I. Attention is all you need. In Proc. the 31st International Conference on Neural Information Processing Systems, Dec. 2017, pp.6000–6010.

    Google Scholar 

  4. Kaplan J, McCandlish S, Henighan T, Brown T B, Chess B, Child R, Gray S, Radford A, Wu J, Amodei D. Scaling laws for neural language models. arXiv: 2001. 08361, 2020. https://arxiv.org/abs/2001.08361, May 2024.

    Google Scholar 

  5. Xue F Z, Fu Y, Zhou W C S, Zheng Z W, You Y. To repeat or not to repeat: Insights from scaling LLM under token-crisis. arXiv: 2305.13230, 2023. https://arxiv.org.abs/2305.13230, May 2024.

    Google Scholar 

  6. Bai Y T, Jones A, Ndousse K et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv: 2204.05862, 2022. https://arxiv.org/abs/2204.05862, May 2024.

    Google Scholar 

  7. Naveed H, Khan A U, Qiu S, Saqib M, Anwar S, Usman M, Akhtar N, Barnes N, Mian A. A comprehensive overview of large language models. arXiv: 2307.06435, 2023. https://arxiv.org/abs/2307.06435, May 2024.

    Google Scholar 

  8. Hosseini M, Gao C A, Liebovitz D, Carvalho A, Ahmad F S, Luo Y, MacDonald N, Holmes K, Kho A. An exploratory survey about using ChatGPT in education, healthcare, and research. PLOS ONE, 18(10): e0292216. https://doi.org/10.1371/journal.pone.0292216.

  9. Ling C, Zhao X J, Lu J Y et al. Domain specialization as the key to make large language models disruptive: A comprehensive survey. arXiv: 2305.18703, 2023. https://arxiv.org/abs/2305.18703, May 2024.

    Google Scholar 

  10. Wu L K, Zheng Z, Qiu Z P, Wang H, Gu H C, Shen T J, Qin C, Zhu C, Zhu H S, Liu Q, Xiong H, Chen E H. A survey on large language models for recommendation. arXiv: 2305.19860, 2023. https://arxiv.org/abs/2305.19860, May 2024.

    Google Scholar 

  11. Wang J J, Huang Y C, Chen C Y, Liu Z, Wang S, Wang Q. Software testing with large language models: Survey, landscape, and vision. arXiv: 2307.07221, 2024. https://arxiv.org/abs/2307.07221, May 2024.

    Google Scholar 

  12. Kasneci E, Sessler K, Küchemann S et al. ChatGPT for good? On opportunities and challenges of large language models for education. Learning and Individual Differences, 2023, 103: 102274. DOI: https://doi.org/10.1016/j.lindif.2023.102274.

    Article  Google Scholar 

  13. Wang B Y, Xie Q Q, Pei J H, Chen Z H, Tiwari P, Li Z, Fu J. Pre-trained language models in biomedical domain: A systematic survey. ACM Computing Surveys, 2024, 56(3): 55. DOI: https://doi.org/10.1145/3611651.

    Article  Google Scholar 

  14. Chang Y P, Wang X, Wang J D et al. A survey on evaluation of large language models. ACM Trans. Intelligent Systems and Technology, 2024, 15(3): 39. DOI: https://doi.org/10.1145/3641289.

    Article  Google Scholar 

  15. Mohamadi S, Mujtaba G, Le N, Doretto G, Adjeroh D A. ChatGPT in the age of generative AI and large language models: A concise survey. arXiv: 2307.04251, 2023. https://arxiv.org/abs/2307.04251, May 2024.

    Google Scholar 

  16. Liu Y H, Han T L, Ma S Y et al. Summary of ChatGPT/GPT-4 research and perspective towards the future of large language models. arXiv: 2304.01852, 2023. https://doi.org/https://arxiv.org/abs/2304.01852v1, May 2024.

    Google Scholar 

  17. Zhang C N, Zhang C S, Li C H et al. One small step for generative AI, one giant leap for AGI: A complete survey on ChatGPT in AIGC era. arXiv: 2304.06488, 2023. https://arxiv.org/abs/2304.06488, May 2024.

    Google Scholar 

  18. Chen K P, Shao A Q, Burapacheep J, Li YX. How GPT-3 responds to different publics on climate change and black lives matter: A critical appraisal of equity in conversational AI. arXiv: 2209.13627, 2023. https://arxiv.org/abs/2209.13627, 2024.

    Google Scholar 

  19. Zong M Y, Krishnamachari B. A survey on GPT-3. arXiv: 2212.00857, 2022. https://arxiv.org/abs/2212.00857, May 2024.

    Google Scholar 

  20. Wang H, Hee M S, Awal M R, Choo K T W, Lee R K W. Evaluating GPT-3 generated explanations for hateful content moderation. In Proc. the 32nd International Joint Conference on Artificial Intelligence, Aug. 2023, Article No. 694. DOI: https://doi.org/10.24963/ijcai.2023/694.

    Google Scholar 

  21. Fernandes P, Madaan A, Liu E et al. Bridging the gap: A survey on integrating (Human) feedback for natural language generation. Trans. Association for Computational Linguistics, 2023, 11: 1643–1668. DOI: https://doi.org/10.1162/tacl_a_00626.

    Article  Google Scholar 

  22. De Angelis L, Baglivo F, Arzilli G, Privitera G P, Ferragina P, Tozzi A E, Rizzo C. ChatGPT and the rise of large language models: The new AI-driven infodemic threat in public health. Frontiers in Public Health, 2023, 11: 1166120. DOI: https://doi.org/10.3389/fpubh.2023.1166120.

    Article  Google Scholar 

  23. Dillion D, Tandon N, Gu Y L, Gray K. Can AI language models replace human participants? Trends in Cognitive Sciences, 2023, 27(7): 597–600. DOI: https://doi.org/10.1016/j.tics.2023.04.008.

    Article  Google Scholar 

  24. Egli A. ChatGPT, GPT-4, and other large language models: The next revolution for clinical microbiology? Clinical Infectious Diseases, 2023, 77(9): 1322–1328. DOI: https://doi.org/10.1093/cid/ciad407.

    Article  Google Scholar 

  25. Weidinger L, Mellor J, Rauh M et al. Ethical and social risks of harm from language models. arXiv: 2112.04359, 2021. https://arxiv.org/abs/2112.04359, May 2024.

    Google Scholar 

  26. Bender E M, Gebru T, McMillan-Major A, Shmitchell S. On the dangers of stochastic parrots: Can language models be too big? In Proc. the 2021 ACM Conference on Fairness, Accountability, and Transparency, Mar. 2021, pp.610–623. DOI: https://doi.org/10.1145/3442188.3445922.

    Chapter  Google Scholar 

  27. Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners. 2019. https://openai.com/index/better-languagemodels/, May 2024.

    Google Scholar 

  28. Devlin J, Chang M W, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proc. the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Jun. 2019, pp.4171–4186. DOI: https://doi.org/10.18653/v1/N19-1423.

    Google Scholar 

  29. Du Z X, Qian Y J, Liu X, Ding M, Qiu J Z, Yang Z L, Tang J. GLM: General language model pretraining with autoregressive blank infilling. In Proc. the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), May 2022, pp.320–335. DOI: https://doi.org/10.18653/v1/2022.acl-long.26.

  30. Brown T B, Mann B, Ryder N et al. Language models are few-shot learners. In Proc. the 34th International Conference on Neural Information Processing Systems, Dec. 2020, Article No. 159.

    Google Scholar 

  31. Ouyang L, Wu J, Jiang X et al. Training language models to follow instructions with human feedback. In Proc. the 36th International Conference on Neural Information Processing Systems, Nov. 2022, Article No. 2011.

    Google Scholar 

  32. Thoppilan R, De Freitas D, Hall J et al. LaMDA: Language models for dialog applications. arXiv: 2201.08239, 2022. https://arxiv.org/abs/2201.08239, May 2024.

    Google Scholar 

  33. OpenAI. GPT-4 technical report. arXiv: 2303.08774, 2023. https://arxiv.org/abs/2303.08774, May 2024.

    Google Scholar 

  34. The Vicuna Team. Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality, 2023. https://lmsys.org/blog/2023-03-30-vicuna/, June 2024.

    Google Scholar 

  35. Xu D K, Yen I E H, Zhao J X, Xiao Z B. Rethinking network pruning -Under the pre-train and fine-tune paradigm. In Proc. the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Jun. 2021, pp.2376–2382. DOI: https://doi.org/10.18653/v1/2021.naacl-main.188.

    Google Scholar 

  36. BigScience Workshop. BLOOM: A 176B-parameter openaccess multilingual language model. arXiv: 2211.05100, 2023. https://arxiv.org/abs/2211.05100, May 2024.

    Google Scholar 

  37. Nobata C, Tetreault J, Thomas A, Mehdad Y, Chang Y. Abusive language detection in online user content. In Proc. the 25th International Conference on World Wide Web, Apr. 2016, pp.145–153. DOI: https://doi.org/10.1145/2872427.2883062.

    Chapter  Google Scholar 

  38. Du N, Huang Y P, Dai A M et al. GLaM: Efficient scaling of language models with mixture-of-experts. In Proc. the 39th International Conference on Machine Learning, Jul. 2022, pp.5547–5569.

    Google Scholar 

  39. Touvron H, Lavril T, Izacard G et al. LLaMA: Open and efficient foundation language models. arXiv: 2302.13971, 2023. https://arxiv.org/abs/2302.13971, May 2024.

    Google Scholar 

  40. Rae J W, Borgeaud S, Cai T et al. Scaling language models: Methods, analysis & insights from training gopher. arXiv: 2112.11446, 2022. https://arxiv.org/abs/2112. 11446, May 2024.

  41. Lee K, Ippolito D, Nystrom A, Zhang C Y, Eck D, Callison-Burch C, Carlini N. Deduplicating training data makes language models better. In Proc. the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), May 2022, pp.8424–8445. DOI: https://doi.org/10.18653/v1/2022.acl-long.577.

    Chapter  Google Scholar 

  42. Frank M C. Bridging the data gap between children and large language models. Trends in Cognitive Sciences, 2023, 27(11): 990–992. DOI: https://doi.org/10.1016/j.tics.2023.08.007.

    Article  Google Scholar 

  43. Shin S, Lee S W, Ahn H, Kim S, Kim H, Kim B, Cho K, Lee G, Park W, Ha J W, Sung N. On the effect of pretraining corpora on in-context learning by a large-scale language model. In Proc. the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Jul. 2022, pp.5168–5186. DOI: https://doi.org/10.18653/v1/2022.naacl-main.380.

    Google Scholar 

  44. Adiwardana D, Luong M T, So D R, Hall J, Fiedel N, Thoppilan R, Yang Z, Kulshreshtha A, Nemade G, Lu Y F, Le Q V. Towards a human-like open-domain chatbot. arXiv: 2001.09977, 2020. https://arxiv.org/abs/2001.09977, May 2024.

    Google Scholar 

  45. Peng B L, Li C Y, He P C, Galley M, Gao J F. Instruction tuning with GPT-4. arXiv: 2304.03277, 2023. https://arxiv.org/abs/2304.03277, May 2024.

    Google Scholar 

  46. McKenna N, Li T Y, Cheng L, Hosseini M, Johnson M, Steedman M. Sources of hallucination by large language models on inference tasks. In Proc. the 2023 Findings of the Association for Computational Linguistics, Dec. 2023, pp.2758–2774. DOI: https://doi.org/10.18653/v1/2023.findings-emnlp.182.

    Chapter  Google Scholar 

  47. Deshpande A, Murahari V, Rajpurohit T, Kalyan A, Narasimhan K. Toxicity in chatgpt: Analyzing persona-assigned language models. In Proc. the 2023 Findings of the Association for Computational Linguistics, Dec. 2023, pp.1236–1270. DOI: https://doi.org/10.18653/v1/2023.findings-emnlp.88.

    Chapter  Google Scholar 

  48. Amatriain X, Sankar A, Bing J, Bodigutla P K, Hazen T J, Kazi M. Transformer models: An introduction and catalog. arXiv: 2302.07730, 2023. https://arxiv.org/abs/2302.07730, May 2024.

    Google Scholar 

  49. Chowdhery A, Narang S, Devlin J et al. PaLM: Scaling language modeling with pathways. The Journal of Machine Learning Research, 2023, 24(1): 240. DOI: https://doi.org/10.5555/3648699.3648939.

    Google Scholar 

  50. Hoffmann J, Borgeaud S, Mensch A et al. Training compute-optimal large language models. In Proc. the 36th International Conference on Neural Information Processing Systems, Nov. 2024, Article No. 2176. DOI: https://doi.org/10.5555/3600270.3602446.

    Google Scholar 

  51. Wang Y Z, Kordi Y, Mishra S, Liu A, Smith N A, Khashabi D, Hajishirzi H. Self-Instruct: Aligning language models with self-generated instructions. In Proc. the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Jul. 2023, pp.13484–13508. DOI: https://doi.org/10.18653/v1/2023.acl-long.754.

    Chapter  Google Scholar 

  52. Gao L, Biderman S, Black S, Golding L, Hoppe T, Foster C, Phang J, He H, Thite A, Nabeshima N, Presser S, Leahy C. The pile: An 800GB dataset of diverse text for language modeling. arXiv: 2101.00027, 2020. https://arxiv.org/abs/2101.00027, May 2024.

    Google Scholar 

  53. Laurençon H, Saulnier L, Wang T et al. The BigScience ROOTS corpus: A 1.6TB composite multilingual dataset. In Proc. the 36th International Conference on Neural Information Processing Systems, Nov. 2022, Article No. 2306. DOI: https://doi.org/10.5555/3600270.3602576.

    Google Scholar 

  54. Huang G. Network of data: Digital infrastructure. Communication of the CCF, 2021(12): 58–60. (in Chinese)

  55. Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y Q, Li W, Liu P J. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 2020, 21(1): 140.

    MathSciNet  Google Scholar 

  56. Rae J W, Potapenko A, Jayakumar S M, Lillicrap T P. Compressive transformers for long-range sequence modelling. arXiv: 1911.05507, 2019. https://arxiv.org/abs/1911.05507, May 2024.

    Google Scholar 

  57. Wenzek G, Lachaux M A, Conneau A, Chaudhary V, Guzmán F, Joulin A, Grave E. CCNet: Extracting high quality monolingual datasets from web crawl data. In Proc. the 12th Language Resources and Evaluation Conference, May 2020, pp.4003–4012.

    Google Scholar 

  58. Penedo G, Malartic Q, Hesslow D, Cojocaru R, Alobeidli H, Cappelli A, Pannier B, Almazrouei E, Launay J. The refinedweb dataset for falcon LLM: Outperforming curated corpora with web data only. In Proc. the 37th International Conference on Neural Information Processing Systems, Dec. 2023, Article No. 3464. DOI: https://doi.org/10.5555/3666122.3669586.

    Google Scholar 

  59. Lee A, Miranda B, Sundar S, Koyejo S. Beyond scale: The diversity coefficient as a data quality metric demonstrates LLMs are pre-trained on formally diverse data. arXiv: 2306.13840, 2023. https://arxiv.org/abs/2306.13840, May 2024.

  60. Lee K, Chang M W, Toutanova K. Latent retrieval for weakly supervised open domain question answering. In Proc. the 57th Annual Meeting of the Association for Computational Linguistics, Jul. 2019, pp.6086–6096. DOI: https://doi.org/10.18653/v1/P19-1612.

    Chapter  Google Scholar 

  61. Yuan S, Zhao H Y, Du Z X, Ding M, Liu X, Cen Y K, Zou X, Yang Z L, Tang J. WuDaoCorpora: A super large-scale Chinese corpora for pre-training language models. AI Open, 2021, 2: 65–68. DOI: https://doi.org/10.1016/j.aiopen.2021.06.001.

    Article  Google Scholar 

  62. El-Khair I A. 1.5 billion words arabic corpus. arXiv: 1611.04033, 2016. https://arxiv.org/abs/1611.04033, May 2024.

    Google Scholar 

  63. Kakwani D, Kunchukuttan A, Golla S, Gokul N C, Bhattacharyya A, Khapra M M, Kumar P. IndicNLPSuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian lan-guages. In Proc. the 2020 Findings of the Association for Computational Linguistics, Nov. 2020, pp.4948–4961. DOI: https://doi.org/10.18653/v1/2020.findings-emnlp.445.

    Chapter  Google Scholar 

  64. Armengol-Estapé J, Carrino C P, Rodriguez-Penagos C, De Gibert Bonet O, Armentano-Oller C, Gonzalez-Agirre A, Melero M, Villegas M. Are multilingual models the best choice for moderately under-resourced languages? A comprehensive assessment for Catalan. In Proc. the 2021 Findings of the Association for Computational Linguistics, Aug. 2021, pp.4933–4946. DOI: https://doi.org/10.18653/v1/2021.findings-acl.437.

    Chapter  Google Scholar 

  65. Wei J, Wang X Z, Schuurmans D, Bosma M, Ichter B, Xia F, Chi E H, Le Q V, Zhou D. Chain-of-thought prompting elicits reasoning in large language models. In Proc. the 36th International Conference on Neural Information Processing Systems, Nov. 2022, Article No. 1800.

    Google Scholar 

  66. Rajpurkar P, Zhang J, Lopyrev K, Liang P. SQuAD: 100, 000+ questions for machine comprehension of text. In Proc. the 2016 Conference on Empirical Methods in Natural Language Processing, Nov. 2016, pp.2383–2392. DOI: https://doi.org/10.18653/v1/D16-1264.

    Chapter  Google Scholar 

  67. Wang A, Singh A, Michael J, Hill F, Levy O, Bowman S. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proc. the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Nov. 2018, pp.353–355. DOI: https://doi.org/10.18653/v1/W18-5446.

    Chapter  Google Scholar 

  68. Lin S, Hilton J, Evans O. TruthfulQA: Measuring how models mimic human falsehoods. arXiv: 2109.07958, 2022. https://arxiv.org/abs/2109.07958, May 2024.

    Google Scholar 

  69. Gehman S, Gururangan S, Sap M, Choi Y, Smith N A. RealToxicityPrompts: Evaluating neural toxic degeneration in language models. arXiv: 2009.11462, 2020. https://arxiv.org/abs/2009.11462, May 2024.

    Google Scholar 

  70. Zheng L M, Chiang W L, Sheng Y, Zhuang S Y, Wu Z H, Zhuang Y H, Lin Z, Li Z H, Li D C, Xing E P, Zhang H, Gonzalez J E, Stoica I. Judging LLM-as-a-judge with MT-bench and Chatbot Arena. arXiv: 2306.05685, 2023. https://arxiv.org/abs/2306.05685, May 2024.

    Google Scholar 

  71. Kandpal N, Deng H K, Roberts A, Wallace E, Raffel C. Large language models struggle to learn long-tail knowledge. arXiv: 2211.08411, 2022. https://arxiv.org/abs/2211.08411, May 2024.

    Google Scholar 

  72. Razeghi Y, Logan IV R L, Gardner M, Singh S. Impact of pretraining term frequencies on few-shot reasoning. arXiv: 2202.07206, 2022. https://arxiv.org/abs/2202.07206, May 2024.

    Book  Google Scholar 

  73. Xiao L, Chen X L. Enhancing LLM with evolutionary fine tuning for news summary generation. arXiv: 2307. 02839, 2023. https://arxiv.org/abs/2307.02839, May 2024.

    Google Scholar 

  74. Zhang T Y, Ladhak F, Durmus E, Liang P, McKeown K, Hashimoto T B. Benchmarking large language models for news summarization. arXiv: 2301.13848, 2023. https://arxiv.org/abs/2301.13848, May 2024.

  75. Zhu Q, Huang K L, Zhang Z, Zhu X Y, Huang M L. CrossWOZ: A large-scale Chinese cross-domain task-oriented dialogue dataset. Trans. Association for Computational Linguistics, 2020, 8: 281–295. DOI: https://doi.org/10.1162/tacl_a_00314.

    Article  Google Scholar 

  76. Qi L, Lv S W, Li H Y, Liu J, Zhang Y, She Q Q, Wu H, Wang H F, Liu T. DuReadervis: A Chinese dataset for open-domain document visual question answering. In Proc. the 2022 Findings of the Association for Computational Linguistics, May 2022, pp.1338–1351. DOI: https://doi.org/10.18653/v1/2022.findings-acl.105.

    Chapter  Google Scholar 

  77. Zhang J Y, Panthaplackel S, Nie P Y, Li J J, Gligoric M. CoditT5: Pretraining for source code and natural language editing. In Proc. the 37th IEEE/ACM International Conference on Automated Software Engineering, Oct. 2023, Article No. 22. DOI: https://doi.org/10.1145/3551349.3556955.

    Google Scholar 

  78. Le H, Wang Y, Gotmare A D, Savarese S, Hoi S C H. CodeRL: Mastering code generation through pretrained models and deep reinforcement learning. In Proc. the 36th International Conference on Neural Information Processing Systems, Nov. 2022, Article No. 1549. DOI: https://doi.org/10.5555/3600270.3601819.

    Google Scholar 

  79. Hendrycks D, Burns C, Kadavath S, Arora A, Basart S, Tang E, Song D, Steinhardt J. Measuring mathematical problem solving with the math dataset. arXiv: 2103. 03874, 2021. https://arxiv.org/abs/2103.03874, May 2024.

    Google Scholar 

  80. Lu P, Qiu L, Chang K W, Wu Y N, Zhu S C, Rajpurohit T, Clark P, Kalyan A. Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning. arXiv: 2209.14610, 2023. https://arxiv.org/abs/2209.14610, May 2024.

    Google Scholar 

  81. Liu J Y, Huang Z Y, Zhai C X, Liu Q. Learning by applying: A general framework for mathematical reasoning via enhancing explicit knowledge learning. In Proc. the 37th AAAI Conference on Artificial Intelligence, Jun. 2023, pp.4497–4506.

    Google Scholar 

  82. Wang H M, Xin H J, Zheng C Y, Li L, Liu Z Y, Cao Q X, Huang Y Y, Xiong J, Shi H, Xie E Z, Yin J, Li Z G, Liao H, Liang X D. LEGO-Prover: Neural theorem proving with growing libraries. arXiv: 2310.00656, 2023. https://arxiv.org/abs/2310.00656, May 2024.

    Google Scholar 

  83. Wang K, Ren H X, Zhou A J, Lu Z M, Luo S C, Shi W K, Zhang R R, Song L Q, Zhan M J, Li H S. Math-Coder: Seamless code integration in LLMs for enhanced mathematical reasoning. arXiv: 2310.03731, 2023. https://arxiv.org/abs/2310.03731, May 2024.

    Google Scholar 

  84. Trinh T H, Wu Y H, Le Q V, He H, Luong T. Solving olympiad geometry without human demonstrations. Nature, 2024, 625(7995): 476–482. DOI: https://doi.org/10.1038/s41586-023-06747-5.

    Article  Google Scholar 

  85. Hendrycks D, Burns C, Basart S, Zou A, Mazeika M, Song D, Steinhardt J. Measuring massive multitask language understanding. arXiv: 2009.03300, 2021. https://arxiv.org/abs/2009.03300, May 2024.

    Google Scholar 

  86. Snæbjarnarson V, Símonarson H B, Ragnarsson P O, Ingólfsdóttir S L, Jónsson H P, Ƥorsteinsson V, Einarsson H. A warm start and a clean crawled corpus — A recipe for good language models. arXiv: 2201.05601, 2022. https://arxiv.org/abs/2201.05601, May 2024.

    Google Scholar 

  87. Ngo H, Raterink C, Araújo J G M, Zhang I, Chen C, Morisot A, Frosst N. Mitigating harm in language models with conditional-likelihood filtration. arXiv: 2108. 07790, 2021. https://arxiv.org/abs/2108.07790, May 2024.

    Google Scholar 

  88. Zhang S S, Roller S, Goyal N, Artetxe M, Chen M, Chen S H, Dewan C, Diab M, Li X, Lin X V, Mihaylov T, Ott M, Shleifer S, Shuster K, Simig D, Koura P S, Sridhar A, Wang T L, Zettlemoyer L. OPT: Open pre-trained transformer language models. arXiv: 2205.01068, 2022. https://arxiv.org/abs/2205.01068, May 2024.

    Google Scholar 

  89. Wang X, Zhou W K, Zhang Q, Zhou J, Gao S Y, Wang J Z, Zhang M H, Gao X, Chen Y W, Gui T. Farewell to aimless large-scale pretraining: Influential subset selection for language model. arXiv: 2305.12816, 2023. https://arxiv.org/abs/2305.12816, May 2024.

  90. Kwiatkowski T, Palomaki J, Redfield O et al. Natural questions: A benchmark for question answering research. Trans. Association for Computational Linguistics, 2019, 7: 453–466. DOI: https://doi.org/10.1162/tacl_a_00276.

    Article  Google Scholar 

  91. Pérez-Mayos L, Ballesteros M, Wanner L. How much pretraining data do language models need to learn syntax? arXiv: 2109.03160, 2021. https://arxiv.org/abs/2109.03160, May 2024.

    Book  Google Scholar 

  92. Ding N, Chen Y L, Xu B K, Qin Y J, Zheng Z, Hu S D, Liu Z Y, Sun M S, Zhou B W. Enhancing chat language models by scaling high-quality instructional conversations. arXiv: 2305.14233, 2023. https://arxiv.org/abs/2305.14233, May 2024.

    Book  Google Scholar 

  93. Bach S H, Sanh V, Yong Z X et al. PromptSource: An integrated development environment and repository for natural language prompts. arXiv: 2202.01279, 2022. https://arxiv.org/abs/2202.01279, May 2024.

    Google Scholar 

  94. Wang Y Z, Mishra S, Alipoormolabashi P et al. Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks. arXiv: 2204.07705, 2022. https://arxiv.org/abs/2204.07705, May 2024.

    Google Scholar 

  95. Longpre S, Hou L, Vu T, Webson A, Chung H W, Tay Y, Zhou D, Le Q V, Zoph B, Wei J, Roberts A. The flan collection: Designing data and methods for effective instruction tuning. arXiv: 2301.13688, 2023. https://arxiv.org/abs/2301.13688, May 2024.

    Google Scholar 

  96. Wei J, Bosma M, Zhao V Y, Guu K, Yu A W, Lester B, Du N, Dai A M, Le Q V. Finetuned language models are zero-shot learners. arXiv: 2109.01652, 2022. https://arxiv.org/abs/2109.01652, May 2024.

    Google Scholar 

  97. Mishra S, Khashabi D, Baral C, Hajishirzi H. Cross-task generalization via natural language crowdsourcing instructions. arXiv: 2104.08773, 2022. https://arxiv.org/abs/2104.08773, May 2024.

    Book  Google Scholar 

  98. Ji J M, Liu M, Dai J T, Pan X H, Zhang C, Bian C, Zhang C, Sun R Y, Wang Y Z, Yang Y D. BeaverTails: Towards improved safety alignment of LLM via a human-preference dataset. arXiv: 2307.04657, 2023. https://arxiv.org/abs/2307.04657, May 2024.

    Google Scholar 

  99. Köpf A, Kilcher Y, von Rütte D et al. OpenAssistant conversations -Democratizing large language model alignment. arXiv: 2304.07327, 2023. https://arxiv.org/abs/2304.07327, May 2024.

    Google Scholar 

  100. Zhang J, Wu X D, Sheng V S. Learning from crowdsourced labeled data: A survey. Artificial Intelligence Review, 2016, 46(4): 543–576. DOI: https://doi.org/10.1007/s10462-016-9491-9.

    Article  Google Scholar 

  101. Ramamurthy R, Ammanabrolu P, Brantley K, Hessel J, Sifa R, Bauckhage C, Hajishirzi H, Choi Y. Is reinforcement learning (not) for natural language processing: Benchmarks, baselines, and building blocks for natural language policy optimization. arXiv: 2210.01241, 2023. https://arxiv.org/abs/2210.01241, May 2024.

    Google Scholar 

  102. Sun Z Q, Shen Y K, Zhou Q H, Zhang H X, Chen Z F, Cox D, Yang Y M, Gan C. Principle-driven self-alignment of language models from scratch with minimal human supervision. In Proc. the 37th International Conference on Neural Information Processing Systems, Dec. 2023, Article No. 115.

    Google Scholar 

  103. Dettmers T, Pagnoni A, Holtzman A, Zettlemoyer L. QLoRA: Efficient finetuning of quantized LLMs. arXiv: 2305.14314, 2023. https://arxiv.org/abs/2305.14314, May 2024.

    Google Scholar 

  104. Gudibande A, Wallace E, Snell C, Geng X Y, Liu H, Abbeel P, Levine S, Song D. The false promise of imitating proprietary LLMs. arXiv: 2305.15717, 2023. https://arxiv.org/abs/2305.15717, May 2024.

    Google Scholar 

  105. Kim S, Bae S, Shin J, Kang S, Kwak D, Yoo K M, Seo M. Aligning large language models through synthetic feedback. arXiv: 2305.13735, 2023. https://arxiv.org/abs/2305.13735, May 2024.

    Book  Google Scholar 

  106. Madaan A, Tandon N, Gupta P et al. SELF-REFINE: Iterative refinement with self-feedback. In Proc. the 37th International Conference on Neural Information Processing Systems, Dec. 2023, Article No. 2019.

    Google Scholar 

  107. Weng Y X, Zhu M J, Xia F, Li B, He S Z, Liu S P, Sun B, Liu K, Zhao J. Large language models are better reasoners with self-verification. arXiv: 2212.09561, 2023. https://arxiv.org/abs/2212.09561, May 2024.

    Book  Google Scholar 

  108. Yin Z Y, Sun Q S, Guo Q P, Wu J W, Qiu X P, Huang X J. Do large language models know what they don’t know? arXiv: 2305.18153, 2023. https://arxiv.org/abs/2305.18153, May 2024.

    Book  Google Scholar 

  109. Wang P Y, Li L, Chen L, Cai Z F, Zhu D W, Lin B H, Cao Y B, Liu Q, Liu T Y, Sui Z F. Large language models are not fair evaluators. arXiv: 2305.17926, 2023. https://arxiv.org/abs/2305.17926, May 2024.

  110. Reynolds L, McDonell K. Prompt programming for large language models: Beyond the few-shot paradigm. arXiv: 2102.07350, 2021. https://arxiv.org/abs/2102.07350, May 2024.

    Google Scholar 

  111. Dang H, Mecke L, Lehmann F, Goller S, Buschek D. How to prompt? Opportunities and challenges of zero-and few-shot learning for human-AI interaction in creative applications of generative models. arXiv: 2209. 01390, 2022. https://arxiv.org/abs/2209.01390, May 2024.

    Google Scholar 

  112. Zhang S Y, Dong L F, Li X Y, Zhang S, Sun X F, Wang S H, Li J W, Hu R Y, Zhang T W, Wu F, Wang G Y. Instruction tuning for large language models: A survey. arXiv: 2308.10792, 2023. https://arxiv.org/abs/2307.04657, May 2024.

  113. Xu C, Sun Q F, Zheng K, Geng X B, Zhao P, Feng J Z, Tao C Y, Jiang D X. WizardLM: Empowering large language models to follow complex instructions. arXiv: 2304.12244, 2023. https://arxiv.org/abs/2304.12244, May 2024.

  114. Chung J, Kamar E, Amershi S. Increasing diversity while maintaining accuracy: Text data generation with large language models and human interventions. In Proc. the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Jul. 2023, pp.575–593. DOI: https://doi.org/10.18653/v1/2023.acl-long.34.

    Chapter  Google Scholar 

  115. Howard J, Ruder S. Universal language model fine-tuning for text classification. arXiv: 1801.06146, 2018. https://doi.org/https://arxiv.org/abs/1801.06146, May 2024.

    Book  Google Scholar 

  116. Mehrafarin H, Rajaee S, Pilehvar M T. On the importance of data size in probing fine-tuned models. arXiv: 2203.09627, 2022. https://arxiv.org/abs/2203.09627, May 2024.

    Book  Google Scholar 

  117. Chung H W, Hou L, Longpre S et al. Scaling instructionfinetuned language models. arXiv: 2210.11416, 2022. https://arxiv.org/abs/2210.11416, May 2024.

    Google Scholar 

  118. Xue L T, Constant N, Roberts A, Kale M, Al-Rfou R, Siddhant A, Barua A, Raffel C. mT5: A massively multilingual pre-trained text-to-text transformer. arXiv: 2010.11934, 2021. https://arxiv.org/abs/2010.11934, May 2024.

  119. Lai V D, Ngo N T, Veyseh A P B, Man H, Dernoncourt F, Bui T, Nguyen T H. ChatGPT beyond English: Towards a comprehensive evaluation of large language models in multilingual learning. arXiv: 2304.05613, 2023. https://arxiv.org/abs/2304.05613, May 2024.

    Google Scholar 

  120. Xu L, Zhang X W, Dong Q Q. CLUECorpus2020: A large-scale Chinese corpus for pre-training language model. arXiv: 2003.01355, 2020. https://arxiv.org/abs/2003.01355, May 2024.

    Google Scholar 

  121. Gu Y, Tinn R, Cheng H, Lucas M, Usuyama N, Liu X D, Naumann T, Gao J F, Poon H. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Computing for Healthcare, 2022, 3(1): 2. DOI: https://doi.org/10.1145/3458754.

    Article  Google Scholar 

  122. Ren X Z, Zhou P Y, Meng X F et al. PanGu-.: Towards trillion parameter language model with sparse heterogeneous computing. arXiv: 2303.10845, 2023. https://arxiv.org/abs/2303.10845, May 2024.

    Google Scholar 

  123. Zhang R R, Han J M, Liu C, Gao P, Zhou A J, Hu X F, Yan S, Lu P, Li H S, Qiao Y. LLaMA-Adapter: Efficient fine-tuning of language models with zero-init attention. arXiv: 2303.16199, 2023. https://arxiv.org/abs/2303.16199, May 2024.

    Google Scholar 

  124. Jiao W X, Huang J T, Wang W X, He Z W, Liang T, Wang X, Shi S M, Tu Z P. ParroT: Translating during chat using large language models tuned with human translation and feedback. arXiv: 2304.02426, 2023. https://doi.org/https://arxiv.org/abs/2304.02426, May 2024.

    Google Scholar 

  125. Xie Q Q, Han W G, Zhang X, Lai Y Z, Peng M, Lopez-Lira A, Huang J M. PIXIU: A large language model, instruction data and evaluation benchmark for finance. arXiv: 2306.05443,2023. https://doi.org/https://arxiv.org/abs/2306.05443, May 2024.

    Google Scholar 

  126. Wang H C, Liu C, Xi N W, Qiang Z W, Zhao S D, Qin B, Liu T. HuaTuo: Tuning LLaMA model with Chinese medical knowledge. arXiv: 2304.06975, 2023. https://arxiv.org/abs/2304.06975, May 2024.

  127. Bowman S R. Eight things to know about large language models. arXiv: 2304.00612, 2023. https://arxiv.org/abs/2304.00612, May 2024.

    Google Scholar 

  128. Wang Y Z, Ivison H, Dasigi P, Hessel J, Khot T, Chandu K R, Wadden D, MacMillan K, Smith N A, Beltagy I, Hajishirzi H. How far can camels go? Exploring the state of instruction tuning on open resources. arXiv: 2306.04751, 2023. https://arxiv.org/abs/2306.04751, May 2024.

    Google Scholar 

  129. Shi X M, Xu J, Ding J R, Pang J L, Liu S C, Luo S Q, Peng X W, Lu L, Yang H H, Hu M T, Ruan T, Zhang S T. LLM-Mini-CEX: Automatic evaluation of large language model for diagnostic conversation. arXiv: 2308. 07635, 2023. https://arxiv.org/abs/2308.07635, May 2024.

    Google Scholar 

  130. Ganguli D, Lovitt L, Kernion J et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv: 2209.07858, 2022. http://export.arxiv.org/abs/2209.07858v2, May 2024.

    Google Scholar 

  131. Rillig M C, Ågerstrand M, Bi M, Gould K A, Sauerland U. Risks and benefits of large language models for the environment. Environmental Science & Technology, 2023, 57(9): 3464–3466. DOI: https://doi.org/10.1021/acs.est.3c01106.

    Article  Google Scholar 

  132. Anand Y, Nussbaum Z, Duderstadt B, Schmidt B, Mulyar A. GPT4All: Training an assistant-style chatbot with large scale data distillation from GPT-3.5-Turbo. Technical Report, 2023. https://s3.amazonaws.com/static.nomic.ai/gpt4all/2023_GPT4All_Technical_Report.pdf, May 2024.

    Google Scholar 

  133. Li C Y, Wong C, Zhang S, Usuyama N, Liu H T, Yang J W, Naumann T, Poon H, Gao J F. LLaVA-Med: Training a large language-and-vision assistant for biomedicine in one day. arXiv: 2306.00890, 2023. https://arxiv.org/abs/2306.00890, May 2024.

    Google Scholar 

Download references

Conflict of Interest

The authors declare that they have no conflict of interest.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Xin-Jian Ma  (马新建) or Xiang Jing  (景 翔).

Additional information

Fei Du received her B.S. degrees in mathematical science and data science from University of California, Santa Barbara. She is currently a master student at Columbia University in New York City, and has interned in the National Key Laboratory of Data Space Technology and System and the Advanced Institute of Big Data, Beijing. Her research interests include machine learning, NLP, deep learning, and statistical data analysis.

Xin-Jian Ma received his Ph.D. degree in information security from the Institute of Information Engineering, Chinese Academy of Sciences, Beijing, in 2017. He is currently an associate researcher at Advanced Institute of Big Data, Beijing. His research interests include big data, distributed system, and information security.

Jing-Ru Yang received her Ph.D. degree in computer science and technology from Renmin University of China, Beijing. She completed her postdoctoral research with funding from the Boya Program at Peking University, Beijing, and is currently a research associate of the National Key Laboratory of Dataspace Technology and System, Beijing. Her research focuses on data governance technology and systems.

Yi Liu received his Ph.D. degree in software engineering from Peking University, Beijing, in 2019. He is currently an associate researcher at Advanced Institute of Big Data, Beijing. His research interests include serverless computing and service computing.

Chao-Ran Luo received his Ph.D. degree in software engineering from Peking University, Beijing. He is currently a research associate at Advanced Institute of Big Data, Beijing. His research focuses on data space, internet of data, and digital object architecture

Xue-Bin Wang received his Ph.D. degree in computer science from National University of Defence Technology, Changsha, in 2007. He is currently a senior engineer at Advanced Institute of Big Data, Beijing. His research interests are big data and artificial intelligence.

Hai-Ou Jiang received her Ph.D. degree in computer science from Beijing University of Posts and Telecommunications, Beijing, in 2017. She is currently a research associate at Advanced Institute of Big Data, Beijing. Her research interests include cloud computing, big data, and machine learning.

Xiang Jing received his Ph.D. degree in information security from the Institute of Information Engineering, Chinese Academy of Sciences, Beijing, in 2016. He finished his postdoctoral program in computer theory from Software Engineering Institute at Peking University, Beijing. He is currently a research associate in the School of Software and Microelectronics at Peking University, Beijing. His research focuses on operating system, big data, blockchain technology, and industrial internet.

Electronic Supplementary Material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Du, F., Ma, XJ., Yang, JR. et al. A Survey of LLM Datasets: From Autoregressive Model to AI Chatbot. J. Comput. Sci. Technol. 39, 542–566 (2024). https://doi.org/10.1007/s11390-024-3767-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-024-3767-3

Keywords