Skip to main content
Log in

Beyond transparency and explainability: on the need for adequate and contextualized user guidelines for LLM use

  • Original Paper
  • Published:
Ethics and Information Technology Aims and scope Submit manuscript

A Correction to this article was published on 10 March 2025

This article has been updated

Abstract

Large language models (LLMs) such as ChatGPT present immense opportunities, but without proper training for users (and potentially oversight), they carry risks of misuse as well. We argue that current approaches focusing predominantly on transparency and explainability fall short in addressing the diverse needs and concerns of various user groups. We highlight the limitations of existing methodologies and propose a framework anchored on user-centric guidelines. In particular, we argue that LLM users should be given guidelines on what tasks LLMs can do well and which they cannot, which tasks require further guidance or refinement by the user, and context-specific heuristics. We further argue that (some) users should be taught to refine and elaborate adequate prompts, be provided with good procedures for prompt iteration, and be taught efficient ways to verify outputs. We suggest that for users, shifting away from looking at the technology itself, but rather looking at the usage of it within contextualized sociotechnical systems, can help solve many issues related to LLMs. We further emphasize the role of real-world case studies in shaping these guidelines, ensuring they are grounded in practical, applicable strategies. Like any technology, risks of misuse can be managed through education, regulation, and responsible development.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Change history

Notes

  1. For example, Harvard, the University of California, Berkeley, and the University of Missouri have spearheaded efforts to codify guidelines on responsible and ethical use of LLMs within the university context. See https://provost.harvard.edu/guidelines-using-chatgpt-and-other-generative-ai-tools-harvard, https://ethics.berkeley.edu/privacy/appropriate-use-chatgpt-and-similar-ai-tools, https://oai.missouri.edu/chatgpt-artificial-intelligence-and-academic-integrity/.

  2. Importantly, our concern is with mitigation of unintentional or possibly negligent misuse stemming from user ignorance regarding the limitations of these systems. Willful and malicious misuse will still obviously present a problem, but mitigation strategies for this will need to be crafted along very different lines, in keeping with the different nature of such misuses. Exploration of this is beyond the scope of the current article.

  3. See, e.g., (Augenstein et al., 2023; Barman et al., 2024a, 2024b; Chen & Shu, 2023; Mittelstadt et al., 2023).

  4. For examples of such problems arising in real-world contexts, see, e.g., Gallegos et al. (2023), Li et al. (2023) and Salinas et al. (2023).

  5. See Wood (2024) for further exploration of challenges in using XAI to improve effective and responsible use of AI-enabled systems.

  6. See, e.g., Liao and Vaughan (2023) and Wang et al. (2024a, 2024b).

  7. E.g., Bowman (2023) and Zhao et al. (2023). See also the discussion presented in layman’s terms at https://www.linkedin.com/pulse/when-llm-experts-say-we-dont-know-how-pallav-sharda-2tpyc/.

  8. More broadly, emphasis on XAI, assuming it can be fully achieved, may undermine more institutional and human-centric approaches. See Wood (2024).

  9. Some might argue that “rules of thumb” or heuristics for guiding LLM use are not apt to empirical testing or verification. What we have in mind, however, is a general ability to empirically check whether guidelines improve use of LLMs (in terms of users accomplishing the tasks they are employing LLMs for), and in this respect, it should be possible to empirically examine whether guidelines are indeed improving use, detracting from it, or having a negligible impact. The precise impact of various guidelines, and their implementation, would further provide useful running data for the improvement of user interfaces with an eye to ever more effective and responsible LLM use. See also Barman et al., (2024a, 2024b).

  10. For candidate approaches in this direction, see, e.g., Wang et al. (2024a, 2024b and Watkins (2023) as well as https://www.dpc.sa.gov.au/__data/assets/pdf_file/0007/936745/Guideline-13.1-Use-of-Large-Language-Model-AI-Tools-Utilities.pdf and https://www.isc.upenn.edu/security/LLM-guide. See Johri et al. (2023) for more meta-level guidelines embedded within a specific context, i.e., LLM use in the field of medicine.

References

  • Abid, A., Farooqi, M., & Zou, J. (2021). Large language models associate Muslims with violence. Nature Machine Intelligence, 3(6), 461–463.

    Article  Google Scholar 

  • Agarwal, V., Thureja, N., Garg, M. K., Dharmavaram, S., & Kumar, D. (2024). “Which LLM should I use?”: Evaluating LLMs for tasks performed by Undergraduate Computer Science Students in India. Preprint retrieved from arXiv:2402.01687.

  • Arrieta, A. B., Díaz-Rodríguez, N., Del Ser, J., Bennetot, A., Tabik, S., Barbado, A., ... & Herrera, F. (2020). Explainable artificial intelligence (XAI): concepts, taxonomies, opportunities and challenges toward responsible AI. Information fusion58, 82–115.

  • Augenstein, I., Baldwin, T., Cha, M., Chakraborty, T., Ciampaglia, G. L., Corney, D., ... & Zagni, G. (2023). Factuality challenges in the era of large language models. Preprint retrieved from arXiv:2310.05189.

  • Barman, D., Guo, Z., & Conlan, O. (2024). The dark side of language models: Exploring the potential of LLMs in multimedia disinformation generation and dissemination. Machine Learning with Applications, 16, 100545.

    Article  Google Scholar 

  • Barman, K. G., Caron, S., Claassen, T., & De Regt, H. (2024b). Towards a benchmark for scientific understanding in humans and machines. Minds and Machines, 34(1), 1–16.

    Article  MATH  Google Scholar 

  • Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT ‘21), 610–623. https://doi.org/10.1145/3442188.3445922

  • Bills, S., Cammarata, N., Mossing, D., Tillman, H., Gao, L., Goh, G., Sutskever, I., Leike, J., Wu, J., & Saunders, W. (2023) Language models can explain neurons in language models. https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.Html

  • Boge, F. J. (2022). Two dimensions of opacity and the deep learning predicament. Minds and Machines, 32(1), 43–75.

    Article  MATH  Google Scholar 

  • Boiko, D. A., MacKnight, R., & Gomes, G. (2023). Emergent autonomous scientific research capabilities of large language models. Preprint retrieved from https://arxiv.org/abs/2304.05332

  • Burrell, J. (2016). How the machine ‘thinks’: Understanding opacity in machine learning algorithms. Big Data & Society, 3(1), 2053951715622512.

    Article  MATH  Google Scholar 

  • Buruk, Oğuz’Oz. (2023) Academic Writing with GPT-3.5: Reflections on practices, efficacy and transparency. Preprint retrieved from arXiv:2304.11079.

  • Chen, C., & Shu, K. (2023). Combating misinformation in the age of LLMs: Opportunities and challenges. Preprint retrieved from arXiv:2311.05656.

  • Choi, E. (2023). A comprehensive inquiry into the use of ChatGPT: Examining general, educational, and disability-focused perspectives. International Journal of Arts Humanities and Social Sciences. https://doi.org/10.56734/ijahss.v4n11a1

    Article  MATH  Google Scholar 

  • Conmy, A., Mavor-Parker, A. N., Lynch, A., Heimersheim, S., & Garriga-Alonso, A. (2023). Towards automated circuit discovery for mechanistic interpretability. Preprint retrieved from arXiv:2304.14997.

  • Conmy, A., Mavor-Parker, A., Lynch, A., Heimersheim, S., & Garriga-Alonso, A. (2023b). Towards automated circuit discovery for mechanistic interpretability. Advances in Neural Information Processing Systems, 36, 16318–16352.

    Google Scholar 

  • de Fine Licht, K. (2023). Integrating large language models into higher education: guidelines for effective implementation. Computer Sciences & Mathematics Forum, 8(1), 65.

    Google Scholar 

  • Dergaa, I., Chamari, K., Zmijewski, P., & Ben Saad, H. (2023). From human writing to artificial intelligence generated text: Examining the prospects and potential threats of ChatGPT in academic writing. Biology of Sport, 40(2), 615–622. https://doi.org/10.5114/biolsport.2023.125623

    Article  Google Scholar 

  • Durán, J. M. (2021). Dissecting scientific explanation in AI (sXAI): A case for medicine and healthcare. Artificial Intelligence, 297, 103498.

    Article  MathSciNet  MATH  Google Scholar 

  • Eloundou, T., Manning, S., Mishkin, P., & Rock, D. (2023). Gpts are gpts: An early look at the labor market impact potential of large language models. Preprint retrieved from arXiv:2303.10130.

  • Essel, H. B., Vlachopoulos, D., Essuman, A. B., & Amankwa, J. O. (2024). ChatGPT effects on cognitive skills of undergraduate students: Receiving instant responses from AI-based conversational large language models (LLMs). Computers and Education: Artificial Intelligence, 6, 100198.

    Google Scholar 

  • Extance, A. (2023). ChatGPT has entered the classroom: How LLMs could transform education. Nature, 623(7987), 474–477.

    Article  Google Scholar 

  • Fan, L., Li, L., Ma, Z., Lee, S., Yu, H., & Hemphill, L. (2023). A bibliometric review of large language models research from 2017 to 2023. Preprint retrieved from https://doi.org/10.48550/arXiv.2304.02020

  • Fear, K., & Gleber, C. (2023). Shaping the future of older adult care: ChatGPT, advanced AI, and the transformation of clinical practice. JMIR Aging, 6(1), e51776.

    Article  Google Scholar 

  • Ferrara, E. (2023). Should chatgpt be biased? Challenges and risks of bias in large language models. Preprint retrieved from arXiv:2304.03738.

  • Gallegos, I. O., Rossi, R. A., Barrow, J., Tanjim, M. M., Kim, S., Dernoncourt, F., ... & Ahmed, N. K. (2023). Bias and fairness in large language models: A survey. Preprint retrieved from arXiv:2309.00770.

  • Girotra, K., Meincke, L., Terwiesch, C., & Ulrich, K. T. (2023). Ideas are dimes a dozen: Large language models for idea generation in innovation. Available at SSRN 4526071.

  • Guo, Y., & Lee, D. (2023). Leveraging chatgpt for enhancing critical thinking skills. Journal of Chemical Education, 100(12), 4876–4883.

    Article  MATH  Google Scholar 

  • Hadi, M. U., Al-Tashi, Q., Qureshi, R., Shah, A., Muneer, A., Irfan, M., Zafar, A., Shaikh, M. B., Akhtar, N., Wu, J., Mirjalili, S., & Shah, M. (2023). Large language models: A comprehensive survey of its applications, challenges, limitations, and future prospects. Preprint retrieved from https://doi.org/10.36227/techrxiv.23589741.v4

  • Hadi, M. U., Qureshi, R., Shah, A., Irfan, M., Zafar, A., Shaikh, M. B., ... & Mirjalili, S. (2023). Large language models: A comprehensive survey of its applications, challenges, limitations, and future prospects. Authorea Preprints.

  • Humphreys, P. (2009). The philosophical novelty of computer simulation methods. Synthese, 169, 615–626.

    Article  MathSciNet  MATH  Google Scholar 

  • Inagaki, T., Kato, A., Takahashi, K., Ozaki, H., & Kanda, G. N. (2023). LLMs can generate robotic scripts from goal-oriented instructions in biological laboratory automation. Preprint retrieved from https://doi.org/10.48550/arXiv.2304.10267

  • Jablonka, K. M., Ai, Q., Al-Feghali, A., Badhwar, S., Bocarsly, J. D., Bran, A. M., Bringuier, S., Brinson, L. C., Choudhary, K., Circi, D., Cox, S., de Jong, W. A., Evans, M. L., Gastellu, N., Genzling, J., Gil, M. V., Gupta, A. K., Hong, Z., Imran, A., ... Blaiszik, B. (2023). 14 examples of how LLMs can transform materials science and chemistry: A reflection on a large language model hackathon. Digital Discovery, 2(5), 1233–1250. https://doi.org/10.1039/d3dd00113j

  • Johri, S., Jeong, J., Tran, B. A., Schlessinger, D. I., Wongvibulsin, S., Cai, Z. R., ... & Rajpurkar, P. (2023). Guidelines for rigorous evaluation of clinical LLMs for conversational reasoning. medRxiv, 2023–09.

  • Kasneci, E., Seßler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., ... & Kasneci, G. (2023). ChatGPT for good? On opportunities and challenges of large language models for education. Learning and individual differences103, 102274.

  • Kim, J. K., Chua, M., Rickard, M., & Lorenzo, A. (2023). ChatGPT and large language model (LLM) chatbots: The current state of acceptability and a proposal for guidelines on utilization in academic medicine. Journal of Pediatric Urology., 19, 598.

    Article  Google Scholar 

  • Lee, J., Le, T., Chen, J., & Lee, D. (2023). Do language models plagiarize? In Proceedings of the ACM Web Conference 2023 (pp. 3637–3647). ACM. https://doi.org/10.1145/3543507.3583199

  • Li, Y., Du, M., Song, R., Wang, X., & Wang, Y. (2023). A survey on fairness in large language models. Preprint retrieved from arXiv:2308.10149.

  • Liao, Q. V., & Vaughan, J. W. (2023). Ai transparency in the age of llms: A human-centered research roadmap. Preprint retrieved from arXiv:2306.01941

  • Lin, Z. (2023). Why and how to embrace AI such as ChatGPT in your academic life. Royal Society Open Science, 10(8), 230658. https://doi.org/10.1098/rsos.230658

    Article  Google Scholar 

  • Lundberg, S. M., & Lee, S. I. (2017). A unified approach to interpreting model predictions. Advances in neural information processing systems, 30.

  • Meng, K., Bau, D., Andonian, A., & Belinkov, Y. (2022). Locating and editing factual associations in GPT. Advances in Neural Information Processing Systems, 35, 17359–17372.

    Google Scholar 

  • Mishra, A., Soni, U., Arunkumar, A., Huang, J., Kwon, B. C., & Bryan, C. (2023). Promptaid: Prompt exploration, perturbation, testing and iteration using visual analytics for large language models. Preprint retrieved from arXiv:2304.01964.

  • Mittelstadt, B., Wachter, S., & Russell, C. (2023). To protect science, we must use LLMs as zero-shot translators. Nature Human Behaviour, 7(11), 1830–1832.

    Article  Google Scholar 

  • Noy, S., & Zhang, W. (2023). Experimental evidence on the productivity effects of generative artificial intelligence. Available at SSRN 4375283.

  • OpenAI, R. (2023). Gpt-4 technical report. Preprint retrieved from arxiv:2303.08774. View in Article, 2.

  • Pan, Y., Pan, L., Chen, W., Nakov, P., Kan, M.-Y., & Wang, W. Y. (2023). On the risk of misinformation pollution with large language models. Preprint retrieved from https://doi.org/10.48550/arXiv.2305.13661

  • Qadir, Junaid. (2023) Engineering education in the era of ChatGPT: Promise and pitfalls of generative AI for education. 2023 IEEE Global Engineering Education Conference (EDUCON). IEEE, 2023.

  • Rakap, S. (2023). Chatting with GPT: Enhancing individualized education program goal development for novice special education teachers. Journal of Special Education Technology. https://doi.org/10.1177/01626434231211295

    Article  Google Scholar 

  • Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). Model-agnostic interpretability of machine learning. Preprint retrieved from arXiv:1606.05386.

  • Rudin, C. (2019). Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1(5), 206–215.

    Article  MATH  Google Scholar 

  • Salinas, A., Shah, P., Huang, Y., McCormack, R., & Morstatter, F. (2023, October). The Unequal Opportunities of Large Language Models: Examining Demographic Biases in Job Recommendations by ChatGPT and LLaMA. In Proceedings of the 3rd ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization (pp. 1–15).

  • Schramowski, P., Turan, C., Andersen, N., & Herbert, F. (2022). Large pre-trained language models contain human-like biases of what is right and wrong to do. Nature Machine Intelligence, 4(3), 258–268. https://doi.org/10.1038/s42256-022-00458-8

    Article  Google Scholar 

  • De Silva, D., Mills, N., El-Ayoubi, M., Manic, M., & Alahakoon, D. (2023). ChatGPT and generative AI guidelines for addressing academic integrity and augmenting pre-existing chatbots. In 2023 IEEE International Conference on Industrial Technology (ICIT) (pp. 1–6). IEEE. https://doi.org/10.1109/ICIT58465.2023.10143123

  • Sun, Z. (2023). A short survey of viewing large language models in legal aspect. Preprint retrieved from arXiv:2303.09136.

  • Valentino, M., & Freitas, A. (2022). Scientific explanation and natural language: A unified epistemological-linguistic perspective for explainable AI. Preprint retrieved from arXiv:2205.01809.

  • Vidgof, M., Bachhofner, S., & Mendling, J. (2023). Large language models for business process management: Opportunities and challenges. Preprint retrieved from https://doi.org/10.48550/arXiv.2304.04309

  • Wang, J., Ma, W., Sun, P., Zhang, M., & Nie, J. Y. (2024). Understanding user experience in large language model interactions. Preprint retrieved from arXiv:2401.08329.

  • Wang, L., Chen, X., Deng, X., Wen, H., You, M., Liu, W., & Li, J. (2024). Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs. npj Digital Medicine, 7(1), 41.

    Article  MATH  Google Scholar 

  • Watkins, R. (2023). Guidance for researchers and peer-reviewers on the ethical use of Large Language Models (LLMs) in scientific research workflows. AI and Ethics. https://doi.org/10.1007/s43681-023-00294-5

    Article  MATH  Google Scholar 

  • Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35, 24824–24837. https://doi.org/10.48550/arXiv.2201.11903

    Article  Google Scholar 

  • Williams, N., Ivanov, S., & Buhalis, D. (2023). Algorithmic ghost in the research shell: Large language models and academic knowledge creation in management research. Preprint retrieved from https://doi.org/10.48550/arXiv.2303.07304

  • Wood, N. G. (2024). Explainable AI in the military domain. Ethics and Information Technology, 26(2), 1–13.

    Article  MATH  Google Scholar 

  • Xiao, Z., Yuan, X., Liao, Q. V., Abdelghani, R., & Oudeyer, P.-Y. (2023). Supporting qualitative analysis with large language models: Combining codebook with GPT-3 for deductive coding. In Companion Proceedings of the 28th International Conference on Intelligent User Interfaces (pp. 75–78). ACM. https://doi.org/10.1145/3581754.3584101

  • Yadav, G. (2023). Scaling evidence-based instructional design expertise through large language models. Preprint retrieved from https://doi.org/10.48550/arXiv.2306.01006

  • Yan, L., Sha, L., Zhao, L., Li, Y., Martinez-Maldonado, R., Chen, G., Li, X., Jin, Y., & Gašević, D. (2023). Practical and ethical challenges of large language models in education: A systematic literature review. Preprint retrieved from https://doi.org/10.48550/arXiv.2303.13379

  • Yell, M. M. (2023). Social studies, ChatGPT, and lateral reading. Social Education, 87(3), 138–141.

    MATH  Google Scholar 

  • Zhao, H., Chen, H., Yang, F., Liu, N., Deng, H., Cai, H., ... & Du, M. (2023). Explainability for large language models: A survey. Preprint retrieved from arXiv:2309.01029.

  • Zolanvari, M., Yang, Z., Khan, K., Jain, R., & Meskin, N. (2021). Trust xai: Model-agnostic explanations for ai with a case study on iiot security. IEEE Internet of Things Journal.

Download references

Funding

This work was funded by Fonds Wetenschappelijk Onderzoek (Grant numbers: 1229124N for Kristian González Barman and 1255724N for Pawel Pawlowski) and  the Czech Science Foundation (Grant number  24-12638I for Nathan Wood).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kristian González Barman.

Ethics declarations

Conflict of interest

The authors declare no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The original online version of this article was revised: affiliation and email address of one of the authors was corrected.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Barman, K.G., Wood, N. & Pawlowski, P. Beyond transparency and explainability: on the need for adequate and contextualized user guidelines for LLM use. Ethics Inf Technol 26, 47 (2024). https://doi.org/10.1007/s10676-024-09778-2

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10676-024-09778-2

Keywords