Abstract
Large language models (LLMs) such as ChatGPT present immense opportunities, but without proper training for users (and potentially oversight), they carry risks of misuse as well. We argue that current approaches focusing predominantly on transparency and explainability fall short in addressing the diverse needs and concerns of various user groups. We highlight the limitations of existing methodologies and propose a framework anchored on user-centric guidelines. In particular, we argue that LLM users should be given guidelines on what tasks LLMs can do well and which they cannot, which tasks require further guidance or refinement by the user, and context-specific heuristics. We further argue that (some) users should be taught to refine and elaborate adequate prompts, be provided with good procedures for prompt iteration, and be taught efficient ways to verify outputs. We suggest that for users, shifting away from looking at the technology itself, but rather looking at the usage of it within contextualized sociotechnical systems, can help solve many issues related to LLMs. We further emphasize the role of real-world case studies in shaping these guidelines, ensuring they are grounded in practical, applicable strategies. Like any technology, risks of misuse can be managed through education, regulation, and responsible development.
Similar content being viewed by others
Change history
10 March 2025
A Correction to this paper has been published: https://doi.org/10.1007/s10676-025-09824-7
Notes
For example, Harvard, the University of California, Berkeley, and the University of Missouri have spearheaded efforts to codify guidelines on responsible and ethical use of LLMs within the university context. See https://provost.harvard.edu/guidelines-using-chatgpt-and-other-generative-ai-tools-harvard, https://ethics.berkeley.edu/privacy/appropriate-use-chatgpt-and-similar-ai-tools, https://oai.missouri.edu/chatgpt-artificial-intelligence-and-academic-integrity/.
Importantly, our concern is with mitigation of unintentional or possibly negligent misuse stemming from user ignorance regarding the limitations of these systems. Willful and malicious misuse will still obviously present a problem, but mitigation strategies for this will need to be crafted along very different lines, in keeping with the different nature of such misuses. Exploration of this is beyond the scope of the current article.
See Wood (2024) for further exploration of challenges in using XAI to improve effective and responsible use of AI-enabled systems.
E.g., Bowman (2023) and Zhao et al. (2023). See also the discussion presented in layman’s terms at https://www.linkedin.com/pulse/when-llm-experts-say-we-dont-know-how-pallav-sharda-2tpyc/.
More broadly, emphasis on XAI, assuming it can be fully achieved, may undermine more institutional and human-centric approaches. See Wood (2024).
Some might argue that “rules of thumb” or heuristics for guiding LLM use are not apt to empirical testing or verification. What we have in mind, however, is a general ability to empirically check whether guidelines improve use of LLMs (in terms of users accomplishing the tasks they are employing LLMs for), and in this respect, it should be possible to empirically examine whether guidelines are indeed improving use, detracting from it, or having a negligible impact. The precise impact of various guidelines, and their implementation, would further provide useful running data for the improvement of user interfaces with an eye to ever more effective and responsible LLM use. See also Barman et al., (2024a, 2024b).
For candidate approaches in this direction, see, e.g., Wang et al. (2024a, 2024b and Watkins (2023) as well as https://www.dpc.sa.gov.au/__data/assets/pdf_file/0007/936745/Guideline-13.1-Use-of-Large-Language-Model-AI-Tools-Utilities.pdf and https://www.isc.upenn.edu/security/LLM-guide. See Johri et al. (2023) for more meta-level guidelines embedded within a specific context, i.e., LLM use in the field of medicine.
References
Abid, A., Farooqi, M., & Zou, J. (2021). Large language models associate Muslims with violence. Nature Machine Intelligence, 3(6), 461–463.
Agarwal, V., Thureja, N., Garg, M. K., Dharmavaram, S., & Kumar, D. (2024). “Which LLM should I use?”: Evaluating LLMs for tasks performed by Undergraduate Computer Science Students in India. Preprint retrieved from arXiv:2402.01687.
Arrieta, A. B., Díaz-Rodríguez, N., Del Ser, J., Bennetot, A., Tabik, S., Barbado, A., ... & Herrera, F. (2020). Explainable artificial intelligence (XAI): concepts, taxonomies, opportunities and challenges toward responsible AI. Information fusion, 58, 82–115.
Augenstein, I., Baldwin, T., Cha, M., Chakraborty, T., Ciampaglia, G. L., Corney, D., ... & Zagni, G. (2023). Factuality challenges in the era of large language models. Preprint retrieved from arXiv:2310.05189.
Barman, D., Guo, Z., & Conlan, O. (2024). The dark side of language models: Exploring the potential of LLMs in multimedia disinformation generation and dissemination. Machine Learning with Applications, 16, 100545.
Barman, K. G., Caron, S., Claassen, T., & De Regt, H. (2024b). Towards a benchmark for scientific understanding in humans and machines. Minds and Machines, 34(1), 1–16.
Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT ‘21), 610–623. https://doi.org/10.1145/3442188.3445922
Bills, S., Cammarata, N., Mossing, D., Tillman, H., Gao, L., Goh, G., Sutskever, I., Leike, J., Wu, J., & Saunders, W. (2023) Language models can explain neurons in language models. https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.Html
Boge, F. J. (2022). Two dimensions of opacity and the deep learning predicament. Minds and Machines, 32(1), 43–75.
Boiko, D. A., MacKnight, R., & Gomes, G. (2023). Emergent autonomous scientific research capabilities of large language models. Preprint retrieved from https://arxiv.org/abs/2304.05332
Burrell, J. (2016). How the machine ‘thinks’: Understanding opacity in machine learning algorithms. Big Data & Society, 3(1), 2053951715622512.
Buruk, Oğuz’Oz. (2023) Academic Writing with GPT-3.5: Reflections on practices, efficacy and transparency. Preprint retrieved from arXiv:2304.11079.
Chen, C., & Shu, K. (2023). Combating misinformation in the age of LLMs: Opportunities and challenges. Preprint retrieved from arXiv:2311.05656.
Choi, E. (2023). A comprehensive inquiry into the use of ChatGPT: Examining general, educational, and disability-focused perspectives. International Journal of Arts Humanities and Social Sciences. https://doi.org/10.56734/ijahss.v4n11a1
Conmy, A., Mavor-Parker, A. N., Lynch, A., Heimersheim, S., & Garriga-Alonso, A. (2023). Towards automated circuit discovery for mechanistic interpretability. Preprint retrieved from arXiv:2304.14997.
Conmy, A., Mavor-Parker, A., Lynch, A., Heimersheim, S., & Garriga-Alonso, A. (2023b). Towards automated circuit discovery for mechanistic interpretability. Advances in Neural Information Processing Systems, 36, 16318–16352.
de Fine Licht, K. (2023). Integrating large language models into higher education: guidelines for effective implementation. Computer Sciences & Mathematics Forum, 8(1), 65.
Dergaa, I., Chamari, K., Zmijewski, P., & Ben Saad, H. (2023). From human writing to artificial intelligence generated text: Examining the prospects and potential threats of ChatGPT in academic writing. Biology of Sport, 40(2), 615–622. https://doi.org/10.5114/biolsport.2023.125623
Durán, J. M. (2021). Dissecting scientific explanation in AI (sXAI): A case for medicine and healthcare. Artificial Intelligence, 297, 103498.
Eloundou, T., Manning, S., Mishkin, P., & Rock, D. (2023). Gpts are gpts: An early look at the labor market impact potential of large language models. Preprint retrieved from arXiv:2303.10130.
Essel, H. B., Vlachopoulos, D., Essuman, A. B., & Amankwa, J. O. (2024). ChatGPT effects on cognitive skills of undergraduate students: Receiving instant responses from AI-based conversational large language models (LLMs). Computers and Education: Artificial Intelligence, 6, 100198.
Extance, A. (2023). ChatGPT has entered the classroom: How LLMs could transform education. Nature, 623(7987), 474–477.
Fan, L., Li, L., Ma, Z., Lee, S., Yu, H., & Hemphill, L. (2023). A bibliometric review of large language models research from 2017 to 2023. Preprint retrieved from https://doi.org/10.48550/arXiv.2304.02020
Fear, K., & Gleber, C. (2023). Shaping the future of older adult care: ChatGPT, advanced AI, and the transformation of clinical practice. JMIR Aging, 6(1), e51776.
Ferrara, E. (2023). Should chatgpt be biased? Challenges and risks of bias in large language models. Preprint retrieved from arXiv:2304.03738.
Gallegos, I. O., Rossi, R. A., Barrow, J., Tanjim, M. M., Kim, S., Dernoncourt, F., ... & Ahmed, N. K. (2023). Bias and fairness in large language models: A survey. Preprint retrieved from arXiv:2309.00770.
Girotra, K., Meincke, L., Terwiesch, C., & Ulrich, K. T. (2023). Ideas are dimes a dozen: Large language models for idea generation in innovation. Available at SSRN 4526071.
Guo, Y., & Lee, D. (2023). Leveraging chatgpt for enhancing critical thinking skills. Journal of Chemical Education, 100(12), 4876–4883.
Hadi, M. U., Al-Tashi, Q., Qureshi, R., Shah, A., Muneer, A., Irfan, M., Zafar, A., Shaikh, M. B., Akhtar, N., Wu, J., Mirjalili, S., & Shah, M. (2023). Large language models: A comprehensive survey of its applications, challenges, limitations, and future prospects. Preprint retrieved from https://doi.org/10.36227/techrxiv.23589741.v4
Hadi, M. U., Qureshi, R., Shah, A., Irfan, M., Zafar, A., Shaikh, M. B., ... & Mirjalili, S. (2023). Large language models: A comprehensive survey of its applications, challenges, limitations, and future prospects. Authorea Preprints.
Humphreys, P. (2009). The philosophical novelty of computer simulation methods. Synthese, 169, 615–626.
Inagaki, T., Kato, A., Takahashi, K., Ozaki, H., & Kanda, G. N. (2023). LLMs can generate robotic scripts from goal-oriented instructions in biological laboratory automation. Preprint retrieved from https://doi.org/10.48550/arXiv.2304.10267
Jablonka, K. M., Ai, Q., Al-Feghali, A., Badhwar, S., Bocarsly, J. D., Bran, A. M., Bringuier, S., Brinson, L. C., Choudhary, K., Circi, D., Cox, S., de Jong, W. A., Evans, M. L., Gastellu, N., Genzling, J., Gil, M. V., Gupta, A. K., Hong, Z., Imran, A., ... Blaiszik, B. (2023). 14 examples of how LLMs can transform materials science and chemistry: A reflection on a large language model hackathon. Digital Discovery, 2(5), 1233–1250. https://doi.org/10.1039/d3dd00113j
Johri, S., Jeong, J., Tran, B. A., Schlessinger, D. I., Wongvibulsin, S., Cai, Z. R., ... & Rajpurkar, P. (2023). Guidelines for rigorous evaluation of clinical LLMs for conversational reasoning. medRxiv, 2023–09.
Kasneci, E., Seßler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., ... & Kasneci, G. (2023). ChatGPT for good? On opportunities and challenges of large language models for education. Learning and individual differences, 103, 102274.
Kim, J. K., Chua, M., Rickard, M., & Lorenzo, A. (2023). ChatGPT and large language model (LLM) chatbots: The current state of acceptability and a proposal for guidelines on utilization in academic medicine. Journal of Pediatric Urology., 19, 598.
Lee, J., Le, T., Chen, J., & Lee, D. (2023). Do language models plagiarize? In Proceedings of the ACM Web Conference 2023 (pp. 3637–3647). ACM. https://doi.org/10.1145/3543507.3583199
Li, Y., Du, M., Song, R., Wang, X., & Wang, Y. (2023). A survey on fairness in large language models. Preprint retrieved from arXiv:2308.10149.
Liao, Q. V., & Vaughan, J. W. (2023). Ai transparency in the age of llms: A human-centered research roadmap. Preprint retrieved from arXiv:2306.01941
Lin, Z. (2023). Why and how to embrace AI such as ChatGPT in your academic life. Royal Society Open Science, 10(8), 230658. https://doi.org/10.1098/rsos.230658
Lundberg, S. M., & Lee, S. I. (2017). A unified approach to interpreting model predictions. Advances in neural information processing systems, 30.
Meng, K., Bau, D., Andonian, A., & Belinkov, Y. (2022). Locating and editing factual associations in GPT. Advances in Neural Information Processing Systems, 35, 17359–17372.
Mishra, A., Soni, U., Arunkumar, A., Huang, J., Kwon, B. C., & Bryan, C. (2023). Promptaid: Prompt exploration, perturbation, testing and iteration using visual analytics for large language models. Preprint retrieved from arXiv:2304.01964.
Mittelstadt, B., Wachter, S., & Russell, C. (2023). To protect science, we must use LLMs as zero-shot translators. Nature Human Behaviour, 7(11), 1830–1832.
Noy, S., & Zhang, W. (2023). Experimental evidence on the productivity effects of generative artificial intelligence. Available at SSRN 4375283.
OpenAI, R. (2023). Gpt-4 technical report. Preprint retrieved from arxiv:2303.08774. View in Article, 2.
Pan, Y., Pan, L., Chen, W., Nakov, P., Kan, M.-Y., & Wang, W. Y. (2023). On the risk of misinformation pollution with large language models. Preprint retrieved from https://doi.org/10.48550/arXiv.2305.13661
Qadir, Junaid. (2023) Engineering education in the era of ChatGPT: Promise and pitfalls of generative AI for education. 2023 IEEE Global Engineering Education Conference (EDUCON). IEEE, 2023.
Rakap, S. (2023). Chatting with GPT: Enhancing individualized education program goal development for novice special education teachers. Journal of Special Education Technology. https://doi.org/10.1177/01626434231211295
Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). Model-agnostic interpretability of machine learning. Preprint retrieved from arXiv:1606.05386.
Rudin, C. (2019). Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1(5), 206–215.
Salinas, A., Shah, P., Huang, Y., McCormack, R., & Morstatter, F. (2023, October). The Unequal Opportunities of Large Language Models: Examining Demographic Biases in Job Recommendations by ChatGPT and LLaMA. In Proceedings of the 3rd ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization (pp. 1–15).
Schramowski, P., Turan, C., Andersen, N., & Herbert, F. (2022). Large pre-trained language models contain human-like biases of what is right and wrong to do. Nature Machine Intelligence, 4(3), 258–268. https://doi.org/10.1038/s42256-022-00458-8
De Silva, D., Mills, N., El-Ayoubi, M., Manic, M., & Alahakoon, D. (2023). ChatGPT and generative AI guidelines for addressing academic integrity and augmenting pre-existing chatbots. In 2023 IEEE International Conference on Industrial Technology (ICIT) (pp. 1–6). IEEE. https://doi.org/10.1109/ICIT58465.2023.10143123
Sun, Z. (2023). A short survey of viewing large language models in legal aspect. Preprint retrieved from arXiv:2303.09136.
Valentino, M., & Freitas, A. (2022). Scientific explanation and natural language: A unified epistemological-linguistic perspective for explainable AI. Preprint retrieved from arXiv:2205.01809.
Vidgof, M., Bachhofner, S., & Mendling, J. (2023). Large language models for business process management: Opportunities and challenges. Preprint retrieved from https://doi.org/10.48550/arXiv.2304.04309
Wang, J., Ma, W., Sun, P., Zhang, M., & Nie, J. Y. (2024). Understanding user experience in large language model interactions. Preprint retrieved from arXiv:2401.08329.
Wang, L., Chen, X., Deng, X., Wen, H., You, M., Liu, W., & Li, J. (2024). Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs. npj Digital Medicine, 7(1), 41.
Watkins, R. (2023). Guidance for researchers and peer-reviewers on the ethical use of Large Language Models (LLMs) in scientific research workflows. AI and Ethics. https://doi.org/10.1007/s43681-023-00294-5
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35, 24824–24837. https://doi.org/10.48550/arXiv.2201.11903
Williams, N., Ivanov, S., & Buhalis, D. (2023). Algorithmic ghost in the research shell: Large language models and academic knowledge creation in management research. Preprint retrieved from https://doi.org/10.48550/arXiv.2303.07304
Wood, N. G. (2024). Explainable AI in the military domain. Ethics and Information Technology, 26(2), 1–13.
Xiao, Z., Yuan, X., Liao, Q. V., Abdelghani, R., & Oudeyer, P.-Y. (2023). Supporting qualitative analysis with large language models: Combining codebook with GPT-3 for deductive coding. In Companion Proceedings of the 28th International Conference on Intelligent User Interfaces (pp. 75–78). ACM. https://doi.org/10.1145/3581754.3584101
Yadav, G. (2023). Scaling evidence-based instructional design expertise through large language models. Preprint retrieved from https://doi.org/10.48550/arXiv.2306.01006
Yan, L., Sha, L., Zhao, L., Li, Y., Martinez-Maldonado, R., Chen, G., Li, X., Jin, Y., & Gašević, D. (2023). Practical and ethical challenges of large language models in education: A systematic literature review. Preprint retrieved from https://doi.org/10.48550/arXiv.2303.13379
Yell, M. M. (2023). Social studies, ChatGPT, and lateral reading. Social Education, 87(3), 138–141.
Zhao, H., Chen, H., Yang, F., Liu, N., Deng, H., Cai, H., ... & Du, M. (2023). Explainability for large language models: A survey. Preprint retrieved from arXiv:2309.01029.
Zolanvari, M., Yang, Z., Khan, K., Jain, R., & Meskin, N. (2021). Trust xai: Model-agnostic explanations for ai with a case study on iiot security. IEEE Internet of Things Journal.
Funding
This work was funded by Fonds Wetenschappelijk Onderzoek (Grant numbers: 1229124N for Kristian González Barman and 1255724N for Pawel Pawlowski) and the Czech Science Foundation (Grant number 24-12638I for Nathan Wood).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no conflicts of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The original online version of this article was revised: affiliation and email address of one of the authors was corrected.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Barman, K.G., Wood, N. & Pawlowski, P. Beyond transparency and explainability: on the need for adequate and contextualized user guidelines for LLM use. Ethics Inf Technol 26, 47 (2024). https://doi.org/10.1007/s10676-024-09778-2
Accepted:
Published:
DOI: https://doi.org/10.1007/s10676-024-09778-2