Abstract
Large language models(LLMs) demonstrate significant capabilities in traditional natural language processing(NLP) tasks and many examinations. However, there are few evaluations in regard to specific subjects in the Chinese educational context. This study, focusing on secondary physics and history, explores the potential and limitations of LLMs in Chinese education. Our contributions are as follows: a PH dataset is established, which concludes secondary school physics and history in Chinese, comprising thousands of multiple-choice questions; an evaluation on three prevalent LLMs: ChatGPT, GPT-3, ChatGLM on our PH dataset is made; a new prompt method called One-More-Check(OMC) is proposed to enhance the logical reasoning capacity of LLMs; finally, three LLMs are set to attend an actual secondary history exam. Our findings suggest that our OMC method improves the performance of LLMs on logical reasoning and LLMs underperform average level of age-appropriate students on the exam of history. All datasets, code and evaluation results are available at https://github.com/hcffffff/PH-dataset-OMC.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
C-EVAL official code: https://github.com/SJTU-LIT/ceval.
- 2.
GAOKAO-Bench existing results: https://github.com/OpenLMLab/GAOKAO-Bench/tree/main/result.
- 3.
Use OpenAI API at: https://openai.com/api/..
- 4.
THUDM/ChatGLM-6B: https://github.com/THUDM/ChatGLM-6B.
- 5.
No.2 Middle School Affiliated to Shanghai Jiao Tong University: https://www.jd2fz.sjtu.edu.cn/.
- 6.
High School Affiliated to Shanghai Jiao Tong University: https://fz.sjtu.edu.cn/.
- 7.
GPT-4: https://openai.com/research/gpt-4. We use GPT-4 and ChatGPT for the exams at https://chat.openai.com/.
- 8.
SparkDesk by iFLYTEK can be accessed at https://xinghuo.xfyun.cn/.
- 9.
More detailed results can be found at: https://github.com/hcffffff/PH-dataset-OMC/tree/main/res/simulated-history-exam.
References
Borji, A.: A categorical archive of ChatGPT failures (2023)
Brown, T., et al.: Language models are few-shot learners. Adv. Neural. Inf. Process. Syst. 33, 1877–1901 (2020)
Creswell, A., White, T., Dumoulin, V., Arulkumaran, K., Sengupta, B., Bharath, A.A.: Generative adversarial networks: an overview. IEEE Signal Process. Mag 35(1), 53–65 (2018). https://doi.org/10.1109/msp.2017.2765202 , https://doi.org/10.1109/2Fmsp.2017.2765202
Dhingra, S., Singh, M., SB, V., Malviya, N., Gill, S.S.: Mind meets machine: unravelling GPT-4’s cognitive psychology (2023)
Dong, Q., et al: A survey on in-context learning (2023)
Du, Z., Qian, Y., Liu, X., Ding, M., Qiu, J., Yang, Z., Tang, J.: GLM: general language model pretraining with autoregressive blank infilling. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 320–335 (2022)
Frieder, S., et al.: Mathematical capabilities of ChatGPT (2023)
Huang, Y., et al.: C-Eval: a multi-level multi-discipline Chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322 (2023)
Inaba, T., Kiyomaru, H., Cheng, F., Kurohashi, S.: Multitool-cot: GPT-3 can use multiple external tools with chain of thought prompting (2023)
Kasai, J., Kasai, Y., Sakaguchi, K., Yamada, Y., Radev, D.: Evaluating GPT-4 and ChatGPT on Japanese medical licensing examinations (2023)
Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large language models are zero-shot reasoners (2023)
Li, X., et al.: Chain of knowledge: a framework for grounding large language models with structured knowledge bases (2023)
Min, S., et al.: Rethinking the role of demonstrations: what makes in-context learning work? In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 11048–11064. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (2022). https://aclanthology.org/2022.emnlp-main.759
Mündler, N., He, J., Jenko, S., Vechev, M.: Self-contradictory hallucinations of large language models: evaluation, detection and mitigation (2023)
Nori, H., King, N., McKinney, S.M., Carignan, D., Horvitz, E.: Capabilities of GPT-4 on medical challenge problems (2023)
Nunes, D., Primi, R., Pires, R., Lotufo, R., Nogueira, R.: Evaluating GPT-3.5 and GPT-4 models on Brazilian university admission exams (2023)
OpenAI: GPT-4 technical report (2023)
Ouyang, L., et al.: Training language models to follow instructions with human feedback (2022)
Rae, J.W., et al.: Scaling language models: methods, analysis & insights from training gopher (2022)
Savelka, J., Agarwal, A., Bogart, C., Song, Y., Sakr, M.: Can generative pre-trained transformers (GPT) pass assessments in higher education programming courses? (2023)
Turpin, M., Michael, J., Perez, E., Bowman, S.R.: Language models don’t always say what they think: unfaithful explanations in chain-of-thought prompting (2023)
Wang, B., Yue, X., Sun, H.: Can ChatGPT defend the truth? automatic dialectical evaluation elicits LLMs’ deficiencies in reasoning (2023)
Wang, X., et al.: Self-consistency improves chain of thought reasoning in language models (2023)
Wei, J., et al.: Emergent abilities of large language models (2022)
Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models (2023)
Yao, S., et al.: Tree of thoughts: deliberate problem solving with large language models (2023)
Yao, Y., Li, Z., Zhao, H.: Beyond chain-of-thought, effective graph-of-thought reasoning in large language models (2023)
Ye, X., Durrett, G.: The unreliability of explanations in few-shot prompting for textual reasoning (2022)
Yuan, Z., Yuan, H., Tan, C., Wang, W., Huang, S.: How well do large language models perform in arithmetic tasks? (2023)
Zeng, A., et al.: GLM-130B: an open bilingual pre-trained model. arXiv preprint arXiv:2210.02414 (2022)
Zhang, X., Li, C., Zong, Y., Ying, Z., He, L., Qiu, X.: Evaluating the performance of large language models on GAOKAO benchmark (2023)
Zhao, W.X., et al.: A survey of large language models (2023)
Zhu, W., Thomason, J., Jia, R.: Chain-of-questions training with latent answers for robust multistep question answering (2023)
Ziegler, D.M., et al.: Fine-tuning language models from human preferences (2020)
Acknowledgements
This work is supported by No.2 Middle School Affiliated to Shanghai Jiao Tong University and High School Affiliated to Shanghai Jiao Tong University.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
He, C., Li, C., Han, T., Shen, L. (2024). Assessing and Enhancing LLMs: A Physics and History Dataset and One-More-Check Pipeline Method. In: Luo, B., Cheng, L., Wu, ZG., Li, H., Li, C. (eds) Neural Information Processing. ICONIP 2023. Communications in Computer and Information Science, vol 1967. Springer, Singapore. https://doi.org/10.1007/978-981-99-8178-6_38
Download citation
DOI: https://doi.org/10.1007/978-981-99-8178-6_38
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8177-9
Online ISBN: 978-981-99-8178-6
eBook Packages: Computer ScienceComputer Science (R0)