Skip to main content

Assessing and Enhancing LLMs: A Physics and History Dataset and One-More-Check Pipeline Method

  • Conference paper
  • First Online:
Neural Information Processing (ICONIP 2023)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1967))

Included in the following conference series:

  • 427 Accesses

Abstract

Large language models(LLMs) demonstrate significant capabilities in traditional natural language processing(NLP) tasks and many examinations. However, there are few evaluations in regard to specific subjects in the Chinese educational context. This study, focusing on secondary physics and history, explores the potential and limitations of LLMs in Chinese education. Our contributions are as follows: a PH dataset is established, which concludes secondary school physics and history in Chinese, comprising thousands of multiple-choice questions; an evaluation on three prevalent LLMs: ChatGPT, GPT-3, ChatGLM on our PH dataset is made; a new prompt method called One-More-Check(OMC) is proposed to enhance the logical reasoning capacity of LLMs; finally, three LLMs are set to attend an actual secondary history exam. Our findings suggest that our OMC method improves the performance of LLMs on logical reasoning and LLMs underperform average level of age-appropriate students on the exam of history. All datasets, code and evaluation results are available at https://github.com/hcffffff/PH-dataset-OMC.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    C-EVAL official code: https://github.com/SJTU-LIT/ceval.

  2. 2.

    GAOKAO-Bench existing results: https://github.com/OpenLMLab/GAOKAO-Bench/tree/main/result.

  3. 3.

    Use OpenAI API at: https://openai.com/api/..

  4. 4.

    THUDM/ChatGLM-6B: https://github.com/THUDM/ChatGLM-6B.

  5. 5.

    No.2 Middle School Affiliated to Shanghai Jiao Tong University: https://www.jd2fz.sjtu.edu.cn/.

  6. 6.

    High School Affiliated to Shanghai Jiao Tong University: https://fz.sjtu.edu.cn/.

  7. 7.

    GPT-4: https://openai.com/research/gpt-4. We use GPT-4 and ChatGPT for the exams at https://chat.openai.com/.

  8. 8.

    SparkDesk by iFLYTEK can be accessed at https://xinghuo.xfyun.cn/.

  9. 9.

    More detailed results can be found at: https://github.com/hcffffff/PH-dataset-OMC/tree/main/res/simulated-history-exam.

References

  1. Borji, A.: A categorical archive of ChatGPT failures (2023)

    Google Scholar 

  2. Brown, T., et al.: Language models are few-shot learners. Adv. Neural. Inf. Process. Syst. 33, 1877–1901 (2020)

    Google Scholar 

  3. Creswell, A., White, T., Dumoulin, V., Arulkumaran, K., Sengupta, B., Bharath, A.A.: Generative adversarial networks: an overview. IEEE Signal Process. Mag 35(1), 53–65 (2018). https://doi.org/10.1109/msp.2017.2765202 , https://doi.org/10.1109/2Fmsp.2017.2765202

  4. Dhingra, S., Singh, M., SB, V., Malviya, N., Gill, S.S.: Mind meets machine: unravelling GPT-4’s cognitive psychology (2023)

    Google Scholar 

  5. Dong, Q., et al: A survey on in-context learning (2023)

    Google Scholar 

  6. Du, Z., Qian, Y., Liu, X., Ding, M., Qiu, J., Yang, Z., Tang, J.: GLM: general language model pretraining with autoregressive blank infilling. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 320–335 (2022)

    Google Scholar 

  7. Frieder, S., et al.: Mathematical capabilities of ChatGPT (2023)

    Google Scholar 

  8. Huang, Y., et al.: C-Eval: a multi-level multi-discipline Chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322 (2023)

  9. Inaba, T., Kiyomaru, H., Cheng, F., Kurohashi, S.: Multitool-cot: GPT-3 can use multiple external tools with chain of thought prompting (2023)

    Google Scholar 

  10. Kasai, J., Kasai, Y., Sakaguchi, K., Yamada, Y., Radev, D.: Evaluating GPT-4 and ChatGPT on Japanese medical licensing examinations (2023)

    Google Scholar 

  11. Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large language models are zero-shot reasoners (2023)

    Google Scholar 

  12. Li, X., et al.: Chain of knowledge: a framework for grounding large language models with structured knowledge bases (2023)

    Google Scholar 

  13. Min, S., et al.: Rethinking the role of demonstrations: what makes in-context learning work? In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 11048–11064. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (2022). https://aclanthology.org/2022.emnlp-main.759

  14. Mündler, N., He, J., Jenko, S., Vechev, M.: Self-contradictory hallucinations of large language models: evaluation, detection and mitigation (2023)

    Google Scholar 

  15. Nori, H., King, N., McKinney, S.M., Carignan, D., Horvitz, E.: Capabilities of GPT-4 on medical challenge problems (2023)

    Google Scholar 

  16. Nunes, D., Primi, R., Pires, R., Lotufo, R., Nogueira, R.: Evaluating GPT-3.5 and GPT-4 models on Brazilian university admission exams (2023)

    Google Scholar 

  17. OpenAI: GPT-4 technical report (2023)

    Google Scholar 

  18. Ouyang, L., et al.: Training language models to follow instructions with human feedback (2022)

    Google Scholar 

  19. Rae, J.W., et al.: Scaling language models: methods, analysis & insights from training gopher (2022)

    Google Scholar 

  20. Savelka, J., Agarwal, A., Bogart, C., Song, Y., Sakr, M.: Can generative pre-trained transformers (GPT) pass assessments in higher education programming courses? (2023)

    Google Scholar 

  21. Turpin, M., Michael, J., Perez, E., Bowman, S.R.: Language models don’t always say what they think: unfaithful explanations in chain-of-thought prompting (2023)

    Google Scholar 

  22. Wang, B., Yue, X., Sun, H.: Can ChatGPT defend the truth? automatic dialectical evaluation elicits LLMs’ deficiencies in reasoning (2023)

    Google Scholar 

  23. Wang, X., et al.: Self-consistency improves chain of thought reasoning in language models (2023)

    Google Scholar 

  24. Wei, J., et al.: Emergent abilities of large language models (2022)

    Google Scholar 

  25. Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models (2023)

    Google Scholar 

  26. Yao, S., et al.: Tree of thoughts: deliberate problem solving with large language models (2023)

    Google Scholar 

  27. Yao, Y., Li, Z., Zhao, H.: Beyond chain-of-thought, effective graph-of-thought reasoning in large language models (2023)

    Google Scholar 

  28. Ye, X., Durrett, G.: The unreliability of explanations in few-shot prompting for textual reasoning (2022)

    Google Scholar 

  29. Yuan, Z., Yuan, H., Tan, C., Wang, W., Huang, S.: How well do large language models perform in arithmetic tasks? (2023)

    Google Scholar 

  30. Zeng, A., et al.: GLM-130B: an open bilingual pre-trained model. arXiv preprint arXiv:2210.02414 (2022)

  31. Zhang, X., Li, C., Zong, Y., Ying, Z., He, L., Qiu, X.: Evaluating the performance of large language models on GAOKAO benchmark (2023)

    Google Scholar 

  32. Zhao, W.X., et al.: A survey of large language models (2023)

    Google Scholar 

  33. Zhu, W., Thomason, J., Jia, R.: Chain-of-questions training with latent answers for robust multistep question answering (2023)

    Google Scholar 

  34. Ziegler, D.M., et al.: Fine-tuning language models from human preferences (2020)

    Google Scholar 

Download references

Acknowledgements

This work is supported by No.2 Middle School Affiliated to Shanghai Jiao Tong University and High School Affiliated to Shanghai Jiao Tong University.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Liping Shen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

He, C., Li, C., Han, T., Shen, L. (2024). Assessing and Enhancing LLMs: A Physics and History Dataset and One-More-Check Pipeline Method. In: Luo, B., Cheng, L., Wu, ZG., Li, H., Li, C. (eds) Neural Information Processing. ICONIP 2023. Communications in Computer and Information Science, vol 1967. Springer, Singapore. https://doi.org/10.1007/978-981-99-8178-6_38

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-8178-6_38

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-8177-9

  • Online ISBN: 978-981-99-8178-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics