Assessing and Enhancing LLMs: A Physics and History Dataset and One-More-Check Pipeline Method

He, Chaofan; Li, Chunhui; Han, Tianyuan; Shen, Liping

doi:10.1007/978-981-99-8178-6_38

Chaofan He¹⁰,
Chunhui Li¹⁰,
Tianyuan Han¹⁰ &
…
Liping Shen¹⁰

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1967))

Included in the following conference series:

International Conference on Neural Information Processing

427 Accesses

Abstract

Large language models(LLMs) demonstrate significant capabilities in traditional natural language processing(NLP) tasks and many examinations. However, there are few evaluations in regard to specific subjects in the Chinese educational context. This study, focusing on secondary physics and history, explores the potential and limitations of LLMs in Chinese education. Our contributions are as follows: a PH dataset is established, which concludes secondary school physics and history in Chinese, comprising thousands of multiple-choice questions; an evaluation on three prevalent LLMs: ChatGPT, GPT-3, ChatGLM on our PH dataset is made; a new prompt method called One-More-Check(OMC) is proposed to enhance the logical reasoning capacity of LLMs; finally, three LLMs are set to attend an actual secondary history exam. Our findings suggest that our OMC method improves the performance of LLMs on logical reasoning and LLMs underperform average level of age-appropriate students on the exam of history. All datasets, code and evaluation results are available at https://github.com/hcffffff/PH-dataset-OMC.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
C-EVAL official code: https://github.com/SJTU-LIT/ceval.
2.
GAOKAO-Bench existing results: https://github.com/OpenLMLab/GAOKAO-Bench/tree/main/result.
3.
Use OpenAI API at: https://openai.com/api/..
4.
THUDM/ChatGLM-6B: https://github.com/THUDM/ChatGLM-6B.
5.
No.2 Middle School Affiliated to Shanghai Jiao Tong University: https://www.jd2fz.sjtu.edu.cn/.
6.
High School Affiliated to Shanghai Jiao Tong University: https://fz.sjtu.edu.cn/.
7.
GPT-4: https://openai.com/research/gpt-4. We use GPT-4 and ChatGPT for the exams at https://chat.openai.com/.
8.
SparkDesk by iFLYTEK can be accessed at https://xinghuo.xfyun.cn/.
9.
More detailed results can be found at: https://github.com/hcffffff/PH-dataset-OMC/tree/main/res/simulated-history-exam.

References

Borji, A.: A categorical archive of ChatGPT failures (2023)
Google Scholar
Brown, T., et al.: Language models are few-shot learners. Adv. Neural. Inf. Process. Syst. 33, 1877–1901 (2020)
Google Scholar
Creswell, A., White, T., Dumoulin, V., Arulkumaran, K., Sengupta, B., Bharath, A.A.: Generative adversarial networks: an overview. IEEE Signal Process. Mag 35(1), 53–65 (2018). https://doi.org/10.1109/msp.2017.2765202 , https://doi.org/10.1109/2Fmsp.2017.2765202
Dhingra, S., Singh, M., SB, V., Malviya, N., Gill, S.S.: Mind meets machine: unravelling GPT-4’s cognitive psychology (2023)
Google Scholar
Dong, Q., et al: A survey on in-context learning (2023)
Google Scholar
Du, Z., Qian, Y., Liu, X., Ding, M., Qiu, J., Yang, Z., Tang, J.: GLM: general language model pretraining with autoregressive blank infilling. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 320–335 (2022)
Google Scholar
Frieder, S., et al.: Mathematical capabilities of ChatGPT (2023)
Google Scholar
Huang, Y., et al.: C-Eval: a multi-level multi-discipline Chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322 (2023)
Inaba, T., Kiyomaru, H., Cheng, F., Kurohashi, S.: Multitool-cot: GPT-3 can use multiple external tools with chain of thought prompting (2023)
Google Scholar
Kasai, J., Kasai, Y., Sakaguchi, K., Yamada, Y., Radev, D.: Evaluating GPT-4 and ChatGPT on Japanese medical licensing examinations (2023)
Google Scholar
Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large language models are zero-shot reasoners (2023)
Google Scholar
Li, X., et al.: Chain of knowledge: a framework for grounding large language models with structured knowledge bases (2023)
Google Scholar
Min, S., et al.: Rethinking the role of demonstrations: what makes in-context learning work? In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 11048–11064. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (2022). https://aclanthology.org/2022.emnlp-main.759
Mündler, N., He, J., Jenko, S., Vechev, M.: Self-contradictory hallucinations of large language models: evaluation, detection and mitigation (2023)
Google Scholar
Nori, H., King, N., McKinney, S.M., Carignan, D., Horvitz, E.: Capabilities of GPT-4 on medical challenge problems (2023)
Google Scholar
Nunes, D., Primi, R., Pires, R., Lotufo, R., Nogueira, R.: Evaluating GPT-3.5 and GPT-4 models on Brazilian university admission exams (2023)
Google Scholar
OpenAI: GPT-4 technical report (2023)
Google Scholar
Ouyang, L., et al.: Training language models to follow instructions with human feedback (2022)
Google Scholar
Rae, J.W., et al.: Scaling language models: methods, analysis & insights from training gopher (2022)
Google Scholar
Savelka, J., Agarwal, A., Bogart, C., Song, Y., Sakr, M.: Can generative pre-trained transformers (GPT) pass assessments in higher education programming courses? (2023)
Google Scholar
Turpin, M., Michael, J., Perez, E., Bowman, S.R.: Language models don’t always say what they think: unfaithful explanations in chain-of-thought prompting (2023)
Google Scholar
Wang, B., Yue, X., Sun, H.: Can ChatGPT defend the truth? automatic dialectical evaluation elicits LLMs’ deficiencies in reasoning (2023)
Google Scholar
Wang, X., et al.: Self-consistency improves chain of thought reasoning in language models (2023)
Google Scholar
Wei, J., et al.: Emergent abilities of large language models (2022)
Google Scholar
Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models (2023)
Google Scholar
Yao, S., et al.: Tree of thoughts: deliberate problem solving with large language models (2023)
Google Scholar
Yao, Y., Li, Z., Zhao, H.: Beyond chain-of-thought, effective graph-of-thought reasoning in large language models (2023)
Google Scholar
Ye, X., Durrett, G.: The unreliability of explanations in few-shot prompting for textual reasoning (2022)
Google Scholar
Yuan, Z., Yuan, H., Tan, C., Wang, W., Huang, S.: How well do large language models perform in arithmetic tasks? (2023)
Google Scholar
Zeng, A., et al.: GLM-130B: an open bilingual pre-trained model. arXiv preprint arXiv:2210.02414 (2022)
Zhang, X., Li, C., Zong, Y., Ying, Z., He, L., Qiu, X.: Evaluating the performance of large language models on GAOKAO benchmark (2023)
Google Scholar
Zhao, W.X., et al.: A survey of large language models (2023)
Google Scholar
Zhu, W., Thomason, J., Jia, R.: Chain-of-questions training with latent answers for robust multistep question answering (2023)
Google Scholar
Ziegler, D.M., et al.: Fine-tuning language models from human preferences (2020)
Google Scholar

Download references

Acknowledgements

This work is supported by No.2 Middle School Affiliated to Shanghai Jiao Tong University and High School Affiliated to Shanghai Jiao Tong University.

Author information

Authors and Affiliations

Shanghai Jiao Tong University, Shanghai, China
Chaofan He, Chunhui Li, Tianyuan Han & Liping Shen

Authors

Chaofan He
View author publications
You can also search for this author in PubMed Google Scholar
Chunhui Li
View author publications
You can also search for this author in PubMed Google Scholar
Tianyuan Han
View author publications
You can also search for this author in PubMed Google Scholar
Liping Shen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Liping Shen .

Editor information

Editors and Affiliations

Changsha, China
Biao Luo
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Long Cheng
Institute of Cyber-Systems and Control, Zhejiang University, Hangzhou, China
Zheng-Guang Wu
School of Automation, Guangdong University of Technology, Guangdong, China
Hongyi Li
UNSW Sydney, Sydney, NSW, Australia
Chaojie Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

He, C., Li, C., Han, T., Shen, L. (2024). Assessing and Enhancing LLMs: A Physics and History Dataset and One-More-Check Pipeline Method. In: Luo, B., Cheng, L., Wu, ZG., Li, H., Li, C. (eds) Neural Information Processing. ICONIP 2023. Communications in Computer and Information Science, vol 1967. Springer, Singapore. https://doi.org/10.1007/978-981-99-8178-6_38

Download citation

DOI: https://doi.org/10.1007/978-981-99-8178-6_38
Published: 30 November 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8177-9
Online ISBN: 978-981-99-8178-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Assessing and Enhancing LLMs: A Physics and History Dataset and One-More-Check Pipeline Method