Abstract
Since OpenAI opened access to ChatGPT, large language models (LLMs) become an increasingly popular topic attracting researchers’ attention from abundant domains. However, public researchers meet some problems when developing LLMs given that most of the LLMs are produced by industries and the training details are typically unrevealed. Since datasets are an important setup of LLMs, this paper does a holistic survey on the training datasets used in both the pre-train and fine-tune processes. The paper first summarizes 16 pre-train datasets and 16 fine-tune datasets used in the state-of-the-art LLMs. Secondly, based on the properties of the pre-train and fine-tune processes, it comments on pre-train datasets from quality, quantity, and relation with models, and comments on fine-tune datasets from quality, quantity, and concerns. This study then critically figures out the problems and research trends that exist in current LLM datasets. The study helps public researchers train and investigate LLMs by visual cases and provides useful comments to the research community regarding data development. To the best of our knowledge, this paper is the first to summarize and discuss datasets used in both autoregressive and chat LLMs. The survey offers insights and suggestions to researchers and LLM developers as they build their models, and contributes to the LLM study by pointing out the existing problems of LLM studies from the perspective of data.
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.References
Bang Y, Cahyawijaya S, Lee N et al. A multitask, multilingual, multimodal evaluation of ChatGPT on reasoning, hallucination, and interactivity. In Proc. the 13th International Joint Conference on Natural Language and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, Nov. 2023, pp.675–718. DOI: https://doi.org/10.18653/v1/2023.ijcnlp-main.45.
Zhao W X, Zhou K, Li J Y et al. A survey of large language models. arXiv: 2303.18223, 2023. https://arxiv.org/abs/2303.18223, May 2024.
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł, Polosukhin I. Attention is all you need. In Proc. the 31st International Conference on Neural Information Processing Systems, Dec. 2017, pp.6000–6010.
Kaplan J, McCandlish S, Henighan T, Brown T B, Chess B, Child R, Gray S, Radford A, Wu J, Amodei D. Scaling laws for neural language models. arXiv: 2001. 08361, 2020. https://arxiv.org/abs/2001.08361, May 2024.
Xue F Z, Fu Y, Zhou W C S, Zheng Z W, You Y. To repeat or not to repeat: Insights from scaling LLM under token-crisis. arXiv: 2305.13230, 2023. https://arxiv.org.abs/2305.13230, May 2024.
Bai Y T, Jones A, Ndousse K et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv: 2204.05862, 2022. https://arxiv.org/abs/2204.05862, May 2024.
Naveed H, Khan A U, Qiu S, Saqib M, Anwar S, Usman M, Akhtar N, Barnes N, Mian A. A comprehensive overview of large language models. arXiv: 2307.06435, 2023. https://arxiv.org/abs/2307.06435, May 2024.
Hosseini M, Gao C A, Liebovitz D, Carvalho A, Ahmad F S, Luo Y, MacDonald N, Holmes K, Kho A. An exploratory survey about using ChatGPT in education, healthcare, and research. PLOS ONE, 18(10): e0292216. https://doi.org/10.1371/journal.pone.0292216.
Ling C, Zhao X J, Lu J Y et al. Domain specialization as the key to make large language models disruptive: A comprehensive survey. arXiv: 2305.18703, 2023. https://arxiv.org/abs/2305.18703, May 2024.
Wu L K, Zheng Z, Qiu Z P, Wang H, Gu H C, Shen T J, Qin C, Zhu C, Zhu H S, Liu Q, Xiong H, Chen E H. A survey on large language models for recommendation. arXiv: 2305.19860, 2023. https://arxiv.org/abs/2305.19860, May 2024.
Wang J J, Huang Y C, Chen C Y, Liu Z, Wang S, Wang Q. Software testing with large language models: Survey, landscape, and vision. arXiv: 2307.07221, 2024. https://arxiv.org/abs/2307.07221, May 2024.
Kasneci E, Sessler K, Küchemann S et al. ChatGPT for good? On opportunities and challenges of large language models for education. Learning and Individual Differences, 2023, 103: 102274. DOI: https://doi.org/10.1016/j.lindif.2023.102274.
Wang B Y, Xie Q Q, Pei J H, Chen Z H, Tiwari P, Li Z, Fu J. Pre-trained language models in biomedical domain: A systematic survey. ACM Computing Surveys, 2024, 56(3): 55. DOI: https://doi.org/10.1145/3611651.
Chang Y P, Wang X, Wang J D et al. A survey on evaluation of large language models. ACM Trans. Intelligent Systems and Technology, 2024, 15(3): 39. DOI: https://doi.org/10.1145/3641289.
Mohamadi S, Mujtaba G, Le N, Doretto G, Adjeroh D A. ChatGPT in the age of generative AI and large language models: A concise survey. arXiv: 2307.04251, 2023. https://arxiv.org/abs/2307.04251, May 2024.
Liu Y H, Han T L, Ma S Y et al. Summary of ChatGPT/GPT-4 research and perspective towards the future of large language models. arXiv: 2304.01852, 2023. https://doi.org/https://arxiv.org/abs/2304.01852v1, May 2024.
Zhang C N, Zhang C S, Li C H et al. One small step for generative AI, one giant leap for AGI: A complete survey on ChatGPT in AIGC era. arXiv: 2304.06488, 2023. https://arxiv.org/abs/2304.06488, May 2024.
Chen K P, Shao A Q, Burapacheep J, Li YX. How GPT-3 responds to different publics on climate change and black lives matter: A critical appraisal of equity in conversational AI. arXiv: 2209.13627, 2023. https://arxiv.org/abs/2209.13627, 2024.
Zong M Y, Krishnamachari B. A survey on GPT-3. arXiv: 2212.00857, 2022. https://arxiv.org/abs/2212.00857, May 2024.
Wang H, Hee M S, Awal M R, Choo K T W, Lee R K W. Evaluating GPT-3 generated explanations for hateful content moderation. In Proc. the 32nd International Joint Conference on Artificial Intelligence, Aug. 2023, Article No. 694. DOI: https://doi.org/10.24963/ijcai.2023/694.
Fernandes P, Madaan A, Liu E et al. Bridging the gap: A survey on integrating (Human) feedback for natural language generation. Trans. Association for Computational Linguistics, 2023, 11: 1643–1668. DOI: https://doi.org/10.1162/tacl_a_00626.
De Angelis L, Baglivo F, Arzilli G, Privitera G P, Ferragina P, Tozzi A E, Rizzo C. ChatGPT and the rise of large language models: The new AI-driven infodemic threat in public health. Frontiers in Public Health, 2023, 11: 1166120. DOI: https://doi.org/10.3389/fpubh.2023.1166120.
Dillion D, Tandon N, Gu Y L, Gray K. Can AI language models replace human participants? Trends in Cognitive Sciences, 2023, 27(7): 597–600. DOI: https://doi.org/10.1016/j.tics.2023.04.008.
Egli A. ChatGPT, GPT-4, and other large language models: The next revolution for clinical microbiology? Clinical Infectious Diseases, 2023, 77(9): 1322–1328. DOI: https://doi.org/10.1093/cid/ciad407.
Weidinger L, Mellor J, Rauh M et al. Ethical and social risks of harm from language models. arXiv: 2112.04359, 2021. https://arxiv.org/abs/2112.04359, May 2024.
Bender E M, Gebru T, McMillan-Major A, Shmitchell S. On the dangers of stochastic parrots: Can language models be too big? In Proc. the 2021 ACM Conference on Fairness, Accountability, and Transparency, Mar. 2021, pp.610–623. DOI: https://doi.org/10.1145/3442188.3445922.
Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners. 2019. https://openai.com/index/better-languagemodels/, May 2024.
Devlin J, Chang M W, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proc. the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Jun. 2019, pp.4171–4186. DOI: https://doi.org/10.18653/v1/N19-1423.
Du Z X, Qian Y J, Liu X, Ding M, Qiu J Z, Yang Z L, Tang J. GLM: General language model pretraining with autoregressive blank infilling. In Proc. the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), May 2022, pp.320–335. DOI: https://doi.org/10.18653/v1/2022.acl-long.26.
Brown T B, Mann B, Ryder N et al. Language models are few-shot learners. In Proc. the 34th International Conference on Neural Information Processing Systems, Dec. 2020, Article No. 159.
Ouyang L, Wu J, Jiang X et al. Training language models to follow instructions with human feedback. In Proc. the 36th International Conference on Neural Information Processing Systems, Nov. 2022, Article No. 2011.
Thoppilan R, De Freitas D, Hall J et al. LaMDA: Language models for dialog applications. arXiv: 2201.08239, 2022. https://arxiv.org/abs/2201.08239, May 2024.
OpenAI. GPT-4 technical report. arXiv: 2303.08774, 2023. https://arxiv.org/abs/2303.08774, May 2024.
The Vicuna Team. Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality, 2023. https://lmsys.org/blog/2023-03-30-vicuna/, June 2024.
Xu D K, Yen I E H, Zhao J X, Xiao Z B. Rethinking network pruning -Under the pre-train and fine-tune paradigm. In Proc. the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Jun. 2021, pp.2376–2382. DOI: https://doi.org/10.18653/v1/2021.naacl-main.188.
BigScience Workshop. BLOOM: A 176B-parameter openaccess multilingual language model. arXiv: 2211.05100, 2023. https://arxiv.org/abs/2211.05100, May 2024.
Nobata C, Tetreault J, Thomas A, Mehdad Y, Chang Y. Abusive language detection in online user content. In Proc. the 25th International Conference on World Wide Web, Apr. 2016, pp.145–153. DOI: https://doi.org/10.1145/2872427.2883062.
Du N, Huang Y P, Dai A M et al. GLaM: Efficient scaling of language models with mixture-of-experts. In Proc. the 39th International Conference on Machine Learning, Jul. 2022, pp.5547–5569.
Touvron H, Lavril T, Izacard G et al. LLaMA: Open and efficient foundation language models. arXiv: 2302.13971, 2023. https://arxiv.org/abs/2302.13971, May 2024.
Rae J W, Borgeaud S, Cai T et al. Scaling language models: Methods, analysis & insights from training gopher. arXiv: 2112.11446, 2022. https://arxiv.org/abs/2112. 11446, May 2024.
Lee K, Ippolito D, Nystrom A, Zhang C Y, Eck D, Callison-Burch C, Carlini N. Deduplicating training data makes language models better. In Proc. the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), May 2022, pp.8424–8445. DOI: https://doi.org/10.18653/v1/2022.acl-long.577.
Frank M C. Bridging the data gap between children and large language models. Trends in Cognitive Sciences, 2023, 27(11): 990–992. DOI: https://doi.org/10.1016/j.tics.2023.08.007.
Shin S, Lee S W, Ahn H, Kim S, Kim H, Kim B, Cho K, Lee G, Park W, Ha J W, Sung N. On the effect of pretraining corpora on in-context learning by a large-scale language model. In Proc. the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Jul. 2022, pp.5168–5186. DOI: https://doi.org/10.18653/v1/2022.naacl-main.380.
Adiwardana D, Luong M T, So D R, Hall J, Fiedel N, Thoppilan R, Yang Z, Kulshreshtha A, Nemade G, Lu Y F, Le Q V. Towards a human-like open-domain chatbot. arXiv: 2001.09977, 2020. https://arxiv.org/abs/2001.09977, May 2024.
Peng B L, Li C Y, He P C, Galley M, Gao J F. Instruction tuning with GPT-4. arXiv: 2304.03277, 2023. https://arxiv.org/abs/2304.03277, May 2024.
McKenna N, Li T Y, Cheng L, Hosseini M, Johnson M, Steedman M. Sources of hallucination by large language models on inference tasks. In Proc. the 2023 Findings of the Association for Computational Linguistics, Dec. 2023, pp.2758–2774. DOI: https://doi.org/10.18653/v1/2023.findings-emnlp.182.
Deshpande A, Murahari V, Rajpurohit T, Kalyan A, Narasimhan K. Toxicity in chatgpt: Analyzing persona-assigned language models. In Proc. the 2023 Findings of the Association for Computational Linguistics, Dec. 2023, pp.1236–1270. DOI: https://doi.org/10.18653/v1/2023.findings-emnlp.88.
Amatriain X, Sankar A, Bing J, Bodigutla P K, Hazen T J, Kazi M. Transformer models: An introduction and catalog. arXiv: 2302.07730, 2023. https://arxiv.org/abs/2302.07730, May 2024.
Chowdhery A, Narang S, Devlin J et al. PaLM: Scaling language modeling with pathways. The Journal of Machine Learning Research, 2023, 24(1): 240. DOI: https://doi.org/10.5555/3648699.3648939.
Hoffmann J, Borgeaud S, Mensch A et al. Training compute-optimal large language models. In Proc. the 36th International Conference on Neural Information Processing Systems, Nov. 2024, Article No. 2176. DOI: https://doi.org/10.5555/3600270.3602446.
Wang Y Z, Kordi Y, Mishra S, Liu A, Smith N A, Khashabi D, Hajishirzi H. Self-Instruct: Aligning language models with self-generated instructions. In Proc. the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Jul. 2023, pp.13484–13508. DOI: https://doi.org/10.18653/v1/2023.acl-long.754.
Gao L, Biderman S, Black S, Golding L, Hoppe T, Foster C, Phang J, He H, Thite A, Nabeshima N, Presser S, Leahy C. The pile: An 800GB dataset of diverse text for language modeling. arXiv: 2101.00027, 2020. https://arxiv.org/abs/2101.00027, May 2024.
Laurençon H, Saulnier L, Wang T et al. The BigScience ROOTS corpus: A 1.6TB composite multilingual dataset. In Proc. the 36th International Conference on Neural Information Processing Systems, Nov. 2022, Article No. 2306. DOI: https://doi.org/10.5555/3600270.3602576.
Huang G. Network of data: Digital infrastructure. Communication of the CCF, 2021(12): 58–60. (in Chinese)
Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y Q, Li W, Liu P J. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 2020, 21(1): 140.
Rae J W, Potapenko A, Jayakumar S M, Lillicrap T P. Compressive transformers for long-range sequence modelling. arXiv: 1911.05507, 2019. https://arxiv.org/abs/1911.05507, May 2024.
Wenzek G, Lachaux M A, Conneau A, Chaudhary V, Guzmán F, Joulin A, Grave E. CCNet: Extracting high quality monolingual datasets from web crawl data. In Proc. the 12th Language Resources and Evaluation Conference, May 2020, pp.4003–4012.
Penedo G, Malartic Q, Hesslow D, Cojocaru R, Alobeidli H, Cappelli A, Pannier B, Almazrouei E, Launay J. The refinedweb dataset for falcon LLM: Outperforming curated corpora with web data only. In Proc. the 37th International Conference on Neural Information Processing Systems, Dec. 2023, Article No. 3464. DOI: https://doi.org/10.5555/3666122.3669586.
Lee A, Miranda B, Sundar S, Koyejo S. Beyond scale: The diversity coefficient as a data quality metric demonstrates LLMs are pre-trained on formally diverse data. arXiv: 2306.13840, 2023. https://arxiv.org/abs/2306.13840, May 2024.
Lee K, Chang M W, Toutanova K. Latent retrieval for weakly supervised open domain question answering. In Proc. the 57th Annual Meeting of the Association for Computational Linguistics, Jul. 2019, pp.6086–6096. DOI: https://doi.org/10.18653/v1/P19-1612.
Yuan S, Zhao H Y, Du Z X, Ding M, Liu X, Cen Y K, Zou X, Yang Z L, Tang J. WuDaoCorpora: A super large-scale Chinese corpora for pre-training language models. AI Open, 2021, 2: 65–68. DOI: https://doi.org/10.1016/j.aiopen.2021.06.001.
El-Khair I A. 1.5 billion words arabic corpus. arXiv: 1611.04033, 2016. https://arxiv.org/abs/1611.04033, May 2024.
Kakwani D, Kunchukuttan A, Golla S, Gokul N C, Bhattacharyya A, Khapra M M, Kumar P. IndicNLPSuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian lan-guages. In Proc. the 2020 Findings of the Association for Computational Linguistics, Nov. 2020, pp.4948–4961. DOI: https://doi.org/10.18653/v1/2020.findings-emnlp.445.
Armengol-Estapé J, Carrino C P, Rodriguez-Penagos C, De Gibert Bonet O, Armentano-Oller C, Gonzalez-Agirre A, Melero M, Villegas M. Are multilingual models the best choice for moderately under-resourced languages? A comprehensive assessment for Catalan. In Proc. the 2021 Findings of the Association for Computational Linguistics, Aug. 2021, pp.4933–4946. DOI: https://doi.org/10.18653/v1/2021.findings-acl.437.
Wei J, Wang X Z, Schuurmans D, Bosma M, Ichter B, Xia F, Chi E H, Le Q V, Zhou D. Chain-of-thought prompting elicits reasoning in large language models. In Proc. the 36th International Conference on Neural Information Processing Systems, Nov. 2022, Article No. 1800.
Rajpurkar P, Zhang J, Lopyrev K, Liang P. SQuAD: 100, 000+ questions for machine comprehension of text. In Proc. the 2016 Conference on Empirical Methods in Natural Language Processing, Nov. 2016, pp.2383–2392. DOI: https://doi.org/10.18653/v1/D16-1264.
Wang A, Singh A, Michael J, Hill F, Levy O, Bowman S. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proc. the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Nov. 2018, pp.353–355. DOI: https://doi.org/10.18653/v1/W18-5446.
Lin S, Hilton J, Evans O. TruthfulQA: Measuring how models mimic human falsehoods. arXiv: 2109.07958, 2022. https://arxiv.org/abs/2109.07958, May 2024.
Gehman S, Gururangan S, Sap M, Choi Y, Smith N A. RealToxicityPrompts: Evaluating neural toxic degeneration in language models. arXiv: 2009.11462, 2020. https://arxiv.org/abs/2009.11462, May 2024.
Zheng L M, Chiang W L, Sheng Y, Zhuang S Y, Wu Z H, Zhuang Y H, Lin Z, Li Z H, Li D C, Xing E P, Zhang H, Gonzalez J E, Stoica I. Judging LLM-as-a-judge with MT-bench and Chatbot Arena. arXiv: 2306.05685, 2023. https://arxiv.org/abs/2306.05685, May 2024.
Kandpal N, Deng H K, Roberts A, Wallace E, Raffel C. Large language models struggle to learn long-tail knowledge. arXiv: 2211.08411, 2022. https://arxiv.org/abs/2211.08411, May 2024.
Razeghi Y, Logan IV R L, Gardner M, Singh S. Impact of pretraining term frequencies on few-shot reasoning. arXiv: 2202.07206, 2022. https://arxiv.org/abs/2202.07206, May 2024.
Xiao L, Chen X L. Enhancing LLM with evolutionary fine tuning for news summary generation. arXiv: 2307. 02839, 2023. https://arxiv.org/abs/2307.02839, May 2024.
Zhang T Y, Ladhak F, Durmus E, Liang P, McKeown K, Hashimoto T B. Benchmarking large language models for news summarization. arXiv: 2301.13848, 2023. https://arxiv.org/abs/2301.13848, May 2024.
Zhu Q, Huang K L, Zhang Z, Zhu X Y, Huang M L. CrossWOZ: A large-scale Chinese cross-domain task-oriented dialogue dataset. Trans. Association for Computational Linguistics, 2020, 8: 281–295. DOI: https://doi.org/10.1162/tacl_a_00314.
Qi L, Lv S W, Li H Y, Liu J, Zhang Y, She Q Q, Wu H, Wang H F, Liu T. DuReadervis: A Chinese dataset for open-domain document visual question answering. In Proc. the 2022 Findings of the Association for Computational Linguistics, May 2022, pp.1338–1351. DOI: https://doi.org/10.18653/v1/2022.findings-acl.105.
Zhang J Y, Panthaplackel S, Nie P Y, Li J J, Gligoric M. CoditT5: Pretraining for source code and natural language editing. In Proc. the 37th IEEE/ACM International Conference on Automated Software Engineering, Oct. 2023, Article No. 22. DOI: https://doi.org/10.1145/3551349.3556955.
Le H, Wang Y, Gotmare A D, Savarese S, Hoi S C H. CodeRL: Mastering code generation through pretrained models and deep reinforcement learning. In Proc. the 36th International Conference on Neural Information Processing Systems, Nov. 2022, Article No. 1549. DOI: https://doi.org/10.5555/3600270.3601819.
Hendrycks D, Burns C, Kadavath S, Arora A, Basart S, Tang E, Song D, Steinhardt J. Measuring mathematical problem solving with the math dataset. arXiv: 2103. 03874, 2021. https://arxiv.org/abs/2103.03874, May 2024.
Lu P, Qiu L, Chang K W, Wu Y N, Zhu S C, Rajpurohit T, Clark P, Kalyan A. Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning. arXiv: 2209.14610, 2023. https://arxiv.org/abs/2209.14610, May 2024.
Liu J Y, Huang Z Y, Zhai C X, Liu Q. Learning by applying: A general framework for mathematical reasoning via enhancing explicit knowledge learning. In Proc. the 37th AAAI Conference on Artificial Intelligence, Jun. 2023, pp.4497–4506.
Wang H M, Xin H J, Zheng C Y, Li L, Liu Z Y, Cao Q X, Huang Y Y, Xiong J, Shi H, Xie E Z, Yin J, Li Z G, Liao H, Liang X D. LEGO-Prover: Neural theorem proving with growing libraries. arXiv: 2310.00656, 2023. https://arxiv.org/abs/2310.00656, May 2024.
Wang K, Ren H X, Zhou A J, Lu Z M, Luo S C, Shi W K, Zhang R R, Song L Q, Zhan M J, Li H S. Math-Coder: Seamless code integration in LLMs for enhanced mathematical reasoning. arXiv: 2310.03731, 2023. https://arxiv.org/abs/2310.03731, May 2024.
Trinh T H, Wu Y H, Le Q V, He H, Luong T. Solving olympiad geometry without human demonstrations. Nature, 2024, 625(7995): 476–482. DOI: https://doi.org/10.1038/s41586-023-06747-5.
Hendrycks D, Burns C, Basart S, Zou A, Mazeika M, Song D, Steinhardt J. Measuring massive multitask language understanding. arXiv: 2009.03300, 2021. https://arxiv.org/abs/2009.03300, May 2024.
Snæbjarnarson V, Símonarson H B, Ragnarsson P O, Ingólfsdóttir S L, Jónsson H P, Ƥorsteinsson V, Einarsson H. A warm start and a clean crawled corpus — A recipe for good language models. arXiv: 2201.05601, 2022. https://arxiv.org/abs/2201.05601, May 2024.
Ngo H, Raterink C, Araújo J G M, Zhang I, Chen C, Morisot A, Frosst N. Mitigating harm in language models with conditional-likelihood filtration. arXiv: 2108. 07790, 2021. https://arxiv.org/abs/2108.07790, May 2024.
Zhang S S, Roller S, Goyal N, Artetxe M, Chen M, Chen S H, Dewan C, Diab M, Li X, Lin X V, Mihaylov T, Ott M, Shleifer S, Shuster K, Simig D, Koura P S, Sridhar A, Wang T L, Zettlemoyer L. OPT: Open pre-trained transformer language models. arXiv: 2205.01068, 2022. https://arxiv.org/abs/2205.01068, May 2024.
Wang X, Zhou W K, Zhang Q, Zhou J, Gao S Y, Wang J Z, Zhang M H, Gao X, Chen Y W, Gui T. Farewell to aimless large-scale pretraining: Influential subset selection for language model. arXiv: 2305.12816, 2023. https://arxiv.org/abs/2305.12816, May 2024.
Kwiatkowski T, Palomaki J, Redfield O et al. Natural questions: A benchmark for question answering research. Trans. Association for Computational Linguistics, 2019, 7: 453–466. DOI: https://doi.org/10.1162/tacl_a_00276.
Pérez-Mayos L, Ballesteros M, Wanner L. How much pretraining data do language models need to learn syntax? arXiv: 2109.03160, 2021. https://arxiv.org/abs/2109.03160, May 2024.
Ding N, Chen Y L, Xu B K, Qin Y J, Zheng Z, Hu S D, Liu Z Y, Sun M S, Zhou B W. Enhancing chat language models by scaling high-quality instructional conversations. arXiv: 2305.14233, 2023. https://arxiv.org/abs/2305.14233, May 2024.
Bach S H, Sanh V, Yong Z X et al. PromptSource: An integrated development environment and repository for natural language prompts. arXiv: 2202.01279, 2022. https://arxiv.org/abs/2202.01279, May 2024.
Wang Y Z, Mishra S, Alipoormolabashi P et al. Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks. arXiv: 2204.07705, 2022. https://arxiv.org/abs/2204.07705, May 2024.
Longpre S, Hou L, Vu T, Webson A, Chung H W, Tay Y, Zhou D, Le Q V, Zoph B, Wei J, Roberts A. The flan collection: Designing data and methods for effective instruction tuning. arXiv: 2301.13688, 2023. https://arxiv.org/abs/2301.13688, May 2024.
Wei J, Bosma M, Zhao V Y, Guu K, Yu A W, Lester B, Du N, Dai A M, Le Q V. Finetuned language models are zero-shot learners. arXiv: 2109.01652, 2022. https://arxiv.org/abs/2109.01652, May 2024.
Mishra S, Khashabi D, Baral C, Hajishirzi H. Cross-task generalization via natural language crowdsourcing instructions. arXiv: 2104.08773, 2022. https://arxiv.org/abs/2104.08773, May 2024.
Ji J M, Liu M, Dai J T, Pan X H, Zhang C, Bian C, Zhang C, Sun R Y, Wang Y Z, Yang Y D. BeaverTails: Towards improved safety alignment of LLM via a human-preference dataset. arXiv: 2307.04657, 2023. https://arxiv.org/abs/2307.04657, May 2024.
Köpf A, Kilcher Y, von Rütte D et al. OpenAssistant conversations -Democratizing large language model alignment. arXiv: 2304.07327, 2023. https://arxiv.org/abs/2304.07327, May 2024.
Zhang J, Wu X D, Sheng V S. Learning from crowdsourced labeled data: A survey. Artificial Intelligence Review, 2016, 46(4): 543–576. DOI: https://doi.org/10.1007/s10462-016-9491-9.
Ramamurthy R, Ammanabrolu P, Brantley K, Hessel J, Sifa R, Bauckhage C, Hajishirzi H, Choi Y. Is reinforcement learning (not) for natural language processing: Benchmarks, baselines, and building blocks for natural language policy optimization. arXiv: 2210.01241, 2023. https://arxiv.org/abs/2210.01241, May 2024.
Sun Z Q, Shen Y K, Zhou Q H, Zhang H X, Chen Z F, Cox D, Yang Y M, Gan C. Principle-driven self-alignment of language models from scratch with minimal human supervision. In Proc. the 37th International Conference on Neural Information Processing Systems, Dec. 2023, Article No. 115.
Dettmers T, Pagnoni A, Holtzman A, Zettlemoyer L. QLoRA: Efficient finetuning of quantized LLMs. arXiv: 2305.14314, 2023. https://arxiv.org/abs/2305.14314, May 2024.
Gudibande A, Wallace E, Snell C, Geng X Y, Liu H, Abbeel P, Levine S, Song D. The false promise of imitating proprietary LLMs. arXiv: 2305.15717, 2023. https://arxiv.org/abs/2305.15717, May 2024.
Kim S, Bae S, Shin J, Kang S, Kwak D, Yoo K M, Seo M. Aligning large language models through synthetic feedback. arXiv: 2305.13735, 2023. https://arxiv.org/abs/2305.13735, May 2024.
Madaan A, Tandon N, Gupta P et al. SELF-REFINE: Iterative refinement with self-feedback. In Proc. the 37th International Conference on Neural Information Processing Systems, Dec. 2023, Article No. 2019.
Weng Y X, Zhu M J, Xia F, Li B, He S Z, Liu S P, Sun B, Liu K, Zhao J. Large language models are better reasoners with self-verification. arXiv: 2212.09561, 2023. https://arxiv.org/abs/2212.09561, May 2024.
Yin Z Y, Sun Q S, Guo Q P, Wu J W, Qiu X P, Huang X J. Do large language models know what they don’t know? arXiv: 2305.18153, 2023. https://arxiv.org/abs/2305.18153, May 2024.
Wang P Y, Li L, Chen L, Cai Z F, Zhu D W, Lin B H, Cao Y B, Liu Q, Liu T Y, Sui Z F. Large language models are not fair evaluators. arXiv: 2305.17926, 2023. https://arxiv.org/abs/2305.17926, May 2024.
Reynolds L, McDonell K. Prompt programming for large language models: Beyond the few-shot paradigm. arXiv: 2102.07350, 2021. https://arxiv.org/abs/2102.07350, May 2024.
Dang H, Mecke L, Lehmann F, Goller S, Buschek D. How to prompt? Opportunities and challenges of zero-and few-shot learning for human-AI interaction in creative applications of generative models. arXiv: 2209. 01390, 2022. https://arxiv.org/abs/2209.01390, May 2024.
Zhang S Y, Dong L F, Li X Y, Zhang S, Sun X F, Wang S H, Li J W, Hu R Y, Zhang T W, Wu F, Wang G Y. Instruction tuning for large language models: A survey. arXiv: 2308.10792, 2023. https://arxiv.org/abs/2307.04657, May 2024.
Xu C, Sun Q F, Zheng K, Geng X B, Zhao P, Feng J Z, Tao C Y, Jiang D X. WizardLM: Empowering large language models to follow complex instructions. arXiv: 2304.12244, 2023. https://arxiv.org/abs/2304.12244, May 2024.
Chung J, Kamar E, Amershi S. Increasing diversity while maintaining accuracy: Text data generation with large language models and human interventions. In Proc. the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Jul. 2023, pp.575–593. DOI: https://doi.org/10.18653/v1/2023.acl-long.34.
Howard J, Ruder S. Universal language model fine-tuning for text classification. arXiv: 1801.06146, 2018. https://doi.org/https://arxiv.org/abs/1801.06146, May 2024.
Mehrafarin H, Rajaee S, Pilehvar M T. On the importance of data size in probing fine-tuned models. arXiv: 2203.09627, 2022. https://arxiv.org/abs/2203.09627, May 2024.
Chung H W, Hou L, Longpre S et al. Scaling instructionfinetuned language models. arXiv: 2210.11416, 2022. https://arxiv.org/abs/2210.11416, May 2024.
Xue L T, Constant N, Roberts A, Kale M, Al-Rfou R, Siddhant A, Barua A, Raffel C. mT5: A massively multilingual pre-trained text-to-text transformer. arXiv: 2010.11934, 2021. https://arxiv.org/abs/2010.11934, May 2024.
Lai V D, Ngo N T, Veyseh A P B, Man H, Dernoncourt F, Bui T, Nguyen T H. ChatGPT beyond English: Towards a comprehensive evaluation of large language models in multilingual learning. arXiv: 2304.05613, 2023. https://arxiv.org/abs/2304.05613, May 2024.
Xu L, Zhang X W, Dong Q Q. CLUECorpus2020: A large-scale Chinese corpus for pre-training language model. arXiv: 2003.01355, 2020. https://arxiv.org/abs/2003.01355, May 2024.
Gu Y, Tinn R, Cheng H, Lucas M, Usuyama N, Liu X D, Naumann T, Gao J F, Poon H. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Computing for Healthcare, 2022, 3(1): 2. DOI: https://doi.org/10.1145/3458754.
Ren X Z, Zhou P Y, Meng X F et al. PanGu-.: Towards trillion parameter language model with sparse heterogeneous computing. arXiv: 2303.10845, 2023. https://arxiv.org/abs/2303.10845, May 2024.
Zhang R R, Han J M, Liu C, Gao P, Zhou A J, Hu X F, Yan S, Lu P, Li H S, Qiao Y. LLaMA-Adapter: Efficient fine-tuning of language models with zero-init attention. arXiv: 2303.16199, 2023. https://arxiv.org/abs/2303.16199, May 2024.
Jiao W X, Huang J T, Wang W X, He Z W, Liang T, Wang X, Shi S M, Tu Z P. ParroT: Translating during chat using large language models tuned with human translation and feedback. arXiv: 2304.02426, 2023. https://doi.org/https://arxiv.org/abs/2304.02426, May 2024.
Xie Q Q, Han W G, Zhang X, Lai Y Z, Peng M, Lopez-Lira A, Huang J M. PIXIU: A large language model, instruction data and evaluation benchmark for finance. arXiv: 2306.05443,2023. https://doi.org/https://arxiv.org/abs/2306.05443, May 2024.
Wang H C, Liu C, Xi N W, Qiang Z W, Zhao S D, Qin B, Liu T. HuaTuo: Tuning LLaMA model with Chinese medical knowledge. arXiv: 2304.06975, 2023. https://arxiv.org/abs/2304.06975, May 2024.
Bowman S R. Eight things to know about large language models. arXiv: 2304.00612, 2023. https://arxiv.org/abs/2304.00612, May 2024.
Wang Y Z, Ivison H, Dasigi P, Hessel J, Khot T, Chandu K R, Wadden D, MacMillan K, Smith N A, Beltagy I, Hajishirzi H. How far can camels go? Exploring the state of instruction tuning on open resources. arXiv: 2306.04751, 2023. https://arxiv.org/abs/2306.04751, May 2024.
Shi X M, Xu J, Ding J R, Pang J L, Liu S C, Luo S Q, Peng X W, Lu L, Yang H H, Hu M T, Ruan T, Zhang S T. LLM-Mini-CEX: Automatic evaluation of large language model for diagnostic conversation. arXiv: 2308. 07635, 2023. https://arxiv.org/abs/2308.07635, May 2024.
Ganguli D, Lovitt L, Kernion J et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv: 2209.07858, 2022. http://export.arxiv.org/abs/2209.07858v2, May 2024.
Rillig M C, Ågerstrand M, Bi M, Gould K A, Sauerland U. Risks and benefits of large language models for the environment. Environmental Science & Technology, 2023, 57(9): 3464–3466. DOI: https://doi.org/10.1021/acs.est.3c01106.
Anand Y, Nussbaum Z, Duderstadt B, Schmidt B, Mulyar A. GPT4All: Training an assistant-style chatbot with large scale data distillation from GPT-3.5-Turbo. Technical Report, 2023. https://s3.amazonaws.com/static.nomic.ai/gpt4all/2023_GPT4All_Technical_Report.pdf, May 2024.
Li C Y, Wong C, Zhang S, Usuyama N, Liu H T, Yang J W, Naumann T, Poon H, Gao J F. LLaVA-Med: Training a large language-and-vision assistant for biomedicine in one day. arXiv: 2306.00890, 2023. https://arxiv.org/abs/2306.00890, May 2024.
Conflict of Interest
The authors declare that they have no conflict of interest.
Author information
Authors and Affiliations
Corresponding authors
Additional information
Fei Du received her B.S. degrees in mathematical science and data science from University of California, Santa Barbara. She is currently a master student at Columbia University in New York City, and has interned in the National Key Laboratory of Data Space Technology and System and the Advanced Institute of Big Data, Beijing. Her research interests include machine learning, NLP, deep learning, and statistical data analysis.
Xin-Jian Ma received his Ph.D. degree in information security from the Institute of Information Engineering, Chinese Academy of Sciences, Beijing, in 2017. He is currently an associate researcher at Advanced Institute of Big Data, Beijing. His research interests include big data, distributed system, and information security.
Jing-Ru Yang received her Ph.D. degree in computer science and technology from Renmin University of China, Beijing. She completed her postdoctoral research with funding from the Boya Program at Peking University, Beijing, and is currently a research associate of the National Key Laboratory of Dataspace Technology and System, Beijing. Her research focuses on data governance technology and systems.
Yi Liu received his Ph.D. degree in software engineering from Peking University, Beijing, in 2019. He is currently an associate researcher at Advanced Institute of Big Data, Beijing. His research interests include serverless computing and service computing.
Chao-Ran Luo received his Ph.D. degree in software engineering from Peking University, Beijing. He is currently a research associate at Advanced Institute of Big Data, Beijing. His research focuses on data space, internet of data, and digital object architecture
Xue-Bin Wang received his Ph.D. degree in computer science from National University of Defence Technology, Changsha, in 2007. He is currently a senior engineer at Advanced Institute of Big Data, Beijing. His research interests are big data and artificial intelligence.
Hai-Ou Jiang received her Ph.D. degree in computer science from Beijing University of Posts and Telecommunications, Beijing, in 2017. She is currently a research associate at Advanced Institute of Big Data, Beijing. Her research interests include cloud computing, big data, and machine learning.
Xiang Jing received his Ph.D. degree in information security from the Institute of Information Engineering, Chinese Academy of Sciences, Beijing, in 2016. He finished his postdoctoral program in computer theory from Software Engineering Institute at Peking University, Beijing. He is currently a research associate in the School of Software and Microelectronics at Peking University, Beijing. His research focuses on operating system, big data, blockchain technology, and industrial internet.
Electronic Supplementary Material
Rights and permissions
About this article
Cite this article
Du, F., Ma, XJ., Yang, JR. et al. A Survey of LLM Datasets: From Autoregressive Model to AI Chatbot. J. Comput. Sci. Technol. 39, 542–566 (2024). https://doi.org/10.1007/s11390-024-3767-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11390-024-3767-3