Abstract
ChatGPT is a powerful large language model (LLM) that covers knowledge resources such as Wikipedia and supports natural language question answering using its own knowledge. Therefore, there is growing interest in exploring whether ChatGPT can replace traditional knowledge-based question answering (KBQA) models. Although there have been some works analyzing the question answering performance of ChatGPT, there is still a lack of large-scale, comprehensive testing of various types of complex questions to analyze the limitations of the model. In this paper, we present a framework that follows the black-box testing specifications of CheckList proposed by [38]. We evaluate ChatGPT and its family of LLMs on eight real-world KB-based complex question answering datasets, which include six English datasets and two multilingual datasets. The total number of test cases is approximately 190,000. In addition to the GPT family of LLMs, we also evaluate the well-known FLAN-T5 to identify commonalities between the GPT family and other LLMs. The dataset and code are available at https://github.com/tan92hl/Complex-Question-Answering-Evaluation-of-GPT-family.git.
Y. Tan and D. Min—Contribute equally to this work.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bai, Y., et al.: Benchmarking foundation models with language-model-as-an-examiner. arXiv preprint arXiv:2306.04181 (2023)
Bang, Y., et al.: A multitask, multilingual, multimodal evaluation of ChatGPT on reasoning, hallucination, and interactivity. arXiv e-prints, arXiv-2302 (2023)
Belinkov, Y., Glass, J.: Analysis methods in neural language processing: a survey. Trans. Assoc. Comput. Linguist. 7, 49–72 (2019)
Brown, T., et al.: Language models are few-shot learners. Adv. Neural. Inf. Process. Syst. 33, 1877–1901 (2020)
Cao, S., et al.: KQA Pro: a dataset with explicit compositional programs for complex question answering over knowledge base. In: Proceedings ACL Conference, pp. 6101–6119 (2022)
Chang, Y., et al.: A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109 (2023)
Chen, X., et al.: How robust is GPT-3.5 to predecessors? A comprehensive study on language understanding tasks. arXiv e-prints, arXiv-2303 (2023)
Chowdhery, A., et al.: PaLM: scaling language modeling with pathways. arXiv e-prints, arXiv-2204 (2022)
Chung, H.W., et al.: Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2022)
Fu, C., et al.: MME: a comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394 (2023)
Fu, Y., Peng, H., Khot, T.: How does GPT obtain its ability? Tracing emergent abilities of language models to their sources. Yao Fu’s Notion (2022)
Gu, Y., et al.: Beyond IID: three levels of generalization for question answering on knowledge bases. In: Proceedings WWW Conference, pp. 3477–3488 (2021)
Gu, Y., Su, Y.: ArcaneQA: dynamic program induction and contextualized encoding for knowledge base question answering. In: Proceedings COLING Conference, pp. 1718–1731 (2022)
He, H., Choi, J.D.: The stem cell hypothesis: dilemma behind multi-task learning with transformer encoders. In: Proceedings EMNLP Conference, pp. 5555–5577 (2021)
Hu, X., Wu, X., Shu, Y., Qu, Y.: Logical form generation via multi-task learning for complex question answering over knowledge bases. In: Proceedings ICCL Conference, pp. 1687–1696 (2022)
Huang, F., Kwak, H., An, J.: Is ChatGPT better than human annotators? Potential and limitations of ChatGPT in explaining implicit hate speech. arXiv e-prints, arXiv-2302 (2023)
Jiang, Z., Xu, F.F., Araki, J., Neubig, G.: How can we know what language models know? Trans. Assoc. Comput. Linguist. 8, 423–438 (2020)
Kenton, J.D.M.W.C., Toutanova, L.K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings NAACL-HLT Conference, pp. 4171–4186 (2019)
Kocoń, J., et al.: ChatGPT: jack of all trades, master of none. arXiv e-prints, arXiv-2302 (2023)
Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916 (2022)
Liang, P., et al.: Holistic evaluation of language models. arXiv e-prints, arXiv-2211 (2022)
Longpre, S., Lu, Y., Daiber, J.: MKQA: a linguistically diverse benchmark for multilingual open domain question answering. Trans. Assoc. Comput. Linguist. 9, 1389–1406 (2021)
Lyu, C., Xu, J., Wang, L.: New trends in machine translation using large language models: case examples with ChatGPT. arXiv preprint arXiv:2305.01181 (2023)
Ngomo, N.: 9th challenge on question answering over linked data (QALD-9). Language 7(1), 58–64 (2018)
Nie, L., et al.: GraphQ IR: unifying the semantic parsing of graph query languages with one intermediate representation. In: Proceedings EMNLP Conference, pp. 5848–5865 (2022)
Omar, R., Mangukiya, O., Kalnis, P., Mansour, E.: ChatGPT versus traditional question answering for knowledge graphs: current status and future directions towards knowledge graph chatbots. arXiv e-prints, arXiv-2302 (2023)
OpenAI: GPT-4 technical report (2023)
Ouyang, L., et al.: Training language models to follow instructions with human feedback. arXiv e-prints, arXiv-2203 (2022)
Perevalov, A., Yan, X., Kovriguina, L., Jiang, L., Both, A., Usbeck, R.: Knowledge graph question answering leaderboard: a community resource to prevent a replication crisis. In: Proceedings LREC Conference, pp. 2998–3007 (2022)
Petroni, F., et al.: Language models as knowledge bases? In: Proceedings IJCAI Conference, pp. 2463–2473 (2019)
Pramanik, S., Alabi, J., Saha Roy, R., Weikum, G.: UNIQORN: unified question answering over RDF knowledge graphs and natural language text. arXiv e-prints, arXiv-2108 (2021)
Purkayastha, S., Dana, S., Garg, D., Khandelwal, D., Bhargav, G.S.: A deep neural approach to KGQA via SPARQL silhouette generation. In: Proceedings IJCNN Conference, pp. 1–8. IEEE (2022)
Qin, C., Zhang, A., Zhang, Z., Chen, J., Yasunaga, M., Yang, D.: Is ChatGPT a general-purpose natural language processing task solver? arXiv e-prints, arXiv-2302 (2023)
Qin, G., Eisner, J.: Learning how to ask: querying LMS with mixtures of soft prompts. In: Proceedings NAACL-HLT Conference (2021)
Rae, J.W., et al.: Scaling language models: methods, analysis & insights from training gopher. arXiv e-prints, arXiv-2112 (2021)
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(1), 5485–5551 (2020)
Reynolds, L., McDonell, K.: Prompt programming for large language models: beyond the few-shot paradigm. In: Proceedings CHI EA Conference, pp. 1–7 (2021)
Ribeiro, M.T., Wu, T., Guestrin, C., Singh, S.: Beyond accuracy: behavioral testing of NLP models with checklist. In: Proceedings ACL Conference, pp. 4902–4912 (2020)
Rychalska, B., Basaj, D., Gosiewska, A., Biecek, P.: Models in the wild: on corruption robustness of neural NLP systems. In: Proceedings ICONIP Conference, pp. 235–247 (2019)
Segura, S., Fraser, G., Sanchez, A.B., Ruiz-Cortés, A.: A survey on metamorphic testing. IEEE Trans. Software Eng. 42(9), 805–824 (2016)
Srivastava, A., et al.: Beyond the imitation game: quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615 (2022)
Su, Y., et al.: On generating characteristic-rich question sets for QA evaluation. In: Proceedings EMNLP Conference, pp. 562–572 (2016)
Talmor, A., Berant, J.: The web as a knowledge-base for answering complex questions. In: Proceedings ACL Conference, pp. 641–651 (2018)
Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In: Proceedings NAACL-HLT Conference, pp. 142–147 (2003)
Wang, A., et al.: SuperGLUE: a stickier benchmark for general-purpose language understanding systems. In: Proceedings NeurIPS Conference, pp. 3266–3280 (2019)
Wang, J., Liang, Y., Meng, F., Li, Z., Qu, J., Zhou, J.: Cross-lingual summarization via chatgpt. arXiv e-prints, arXiv-2302 (2023)
Wang, S., Scells, H., Koopman, B., Zuccon, G.: Can ChatGPT write a good Boolean query for systematic review literature search? arXiv e-prints, arXiv-2302 (2023)
Wei, J., et al.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022)
Wu, T., Ribeiro, M.T., Heer, J., Weld, D.S.: Errudite: scalable, reproducible, and testable error analysis. In: Proceedings ACL Conference, pp. 747–763 (2019)
Ye, X., Yavuz, S., Hashimoto, K., Zhou, Y., Xiong, C.: RNG-KBQA: generation augmented iterative ranking for knowledge base question answering. In: Proceedings ACL Conference, pp. 6032–6043 (2022)
Yih, W.T., Richardson, M., Meek, C., Chang, M.W., Suh, J.: The value of semantic parse labeling for knowledge base question answering. In: Proceedings ACL Conference, pp. 201–206 (2016)
Zhong, Q., Ding, L., Liu, J., Du, B., Tao, D.: Can ChatGPT understand too? A comparative study on ChatGPT and fine-tuned BERT. arXiv e-prints, arXiv-2302 (2023)
Zhu, K., et al.: PromptBench: towards evaluating the robustness of large language models on adversarial prompts. arXiv preprint arXiv:2306.04528 (2023)
Zhuo, T.Y., Huang, Y., Chen, C., Xing, Z.: Exploring AI ethics of ChatGPT: a diagnostic analysis. arXiv e-prints, arXiv-2301 (2023)
Acknowledgments
This work is supported by the Natural Science Foundation of China (Grant No. U21A20488). We thank the Big Data Computing Center of Southeast University for providing the facility support on the numerical calculations in this paper.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Tan, Y. et al. (2023). Can ChatGPT Replace Traditional KBQA Models? An In-Depth Analysis of the Question Answering Performance of the GPT LLM Family. In: Payne, T.R., et al. The Semantic Web – ISWC 2023. ISWC 2023. Lecture Notes in Computer Science, vol 14265. Springer, Cham. https://doi.org/10.1007/978-3-031-47240-4_19
Download citation
DOI: https://doi.org/10.1007/978-3-031-47240-4_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-47239-8
Online ISBN: 978-3-031-47240-4
eBook Packages: Computer ScienceComputer Science (R0)