ABSTRACT
Large language models possess tremendous natural language understanding and generation abilities. However, they often lack the ability to discern between fact and fiction, leading to factually incorrect responses. Open Government Data are repositories of, often times linked, information that is freely available to everyone. By combining these two technologies in a proof of concept designed application utilizing the GPT3.5 OpenAI model and the Scottish open statistics portal, we show that not only is it possible to augment the large language model’s factuality of responses, but also propose a novel way to effectively access and retrieve statistical information from the data portal just through natural language querying. We anticipate that this paper will trigger a discussion regarding the transformation of Open Government Portals through large language models.
- Razvan Azamfirei, Sapna R Kudchadkar, and James Fackler. 2023. Large language models and the perils of their hallucinations. Critical Care 27, 1 (2023), 1–2.Google ScholarCross Ref
- Adithya Bhaskar, Alexander R Fabbri, and Greg Durrett. 2022. Zero-shot opinion summarization with GPT-3. arXiv preprint arXiv:2211.15914 (2022).Google Scholar
- Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. arxiv:2005.14165 [cs.CL]Google Scholar
- Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Philip S. Yu, Qiang Yang, and Xing Xie. 2023. A Survey on Evaluation of Large Language Models. arxiv:2307.03109 [cs.CL]Google Scholar
- Xuanting Chen, Junjie Ye, Can Zu, Nuo Xu, Rui Zheng, Minlong Peng, Jie Zhou, Tao Gui, Qi Zhang, and Xuanjing Huang. 2023. How Robust is GPT-3.5 to Predecessors? A Comprehensive Study on Language Understanding Tasks. arXiv preprint arXiv:2303.00293 (2023).Google Scholar
- R Cyganiak and D Reynolds. 2014. The RDF data cube vocabulary: W3C recommendation. W3C Tech. Rep. (2014).Google Scholar
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).Google Scholar
- Santiago González-Carvajal and Eduardo C Garrido-Merchán. 2020. Comparing BERT against traditional machine learning text classification. arXiv preprint arXiv:2005.13012 (2020).Google Scholar
- Evangelos Kalampokis, Areti Karamanou, and Konstantinos Tarabanis. 2019. Interoperability Conflicts in Linked Open Statistical Data. Information 10, 8 (2019). https://doi.org/10.3390/info10080249Google ScholarCross Ref
- Evangelos Kalampokis, Efthimios Tambouris, and Konstantinos Tarabanis. 2016. Linked Open Cube Analytics Systems: Potential and Challenges. IEEE Intelligent Systems 31, 5 (2016), 89–92. https://doi.org/10.1109/MIS.2016.82Google ScholarCross Ref
- Evangelos Kalampokis, Efthimios Tambouris, and Konstantinos Tarabanis. 2017. ICT tools for creating, expanding and exploiting statistical linked Open Data. Statistical Journal of the IAOS 33 (2017), 503–514. https://doi.org/10.3233/SJI-150190 2.Google ScholarCross Ref
- Evangelos Kalampokis, Dimitris Zeginis, and Konstantinos Tarabanis. 2019. On modeling linked open statistical data. Journal of Web Semantics 55 (2019), 56–68. https://doi.org/10.1016/j.websem.2018.11.002Google ScholarDigital Library
- Tiffany H Kung, Morgan Cheatham, Arielle Medenilla, Czarina Sillos, Lorie De Leon, Camille Elepaño, Maria Madriaga, Rimel Aggabao, Giezel Diaz-Candido, James Maningo, 2023. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLoS digital health 2, 2 (2023), e0000198.Google Scholar
- Alexandros Laios, Georgios Theophilou, Diederick De Jong, and Evangelos Kalampokis. 2023. The Future of AI in Ovarian Cancer Research: The Large Language Models Perspective. Cancer Control 30 (2023), 10732748231197915. https://doi.org/10.1177/10732748231197915 PMID: 37624621.Google ScholarCross Ref
- Huayang Li, Yixuan Su, Deng Cai, Yan Wang, and Lemao Liu. 2022. A survey on retrieval-augmented text generation. arXiv preprint arXiv:2202.01110 (2022).Google Scholar
- Yang Liu. 2019. Fine-tune BERT for Extractive Summarization. arxiv:1903.10318 [cs.CL]Google Scholar
- Renze Lou, Kai Zhang, and Wenpeng Yin. 2023. Is Prompt All You Need? No. A Comprehensive and Broader View of Instruction Learning. arxiv:2303.10475 [cs.CL]Google Scholar
- Rahul Mehta and Vasudeva Varma. 2023. LLM-RM at SemEval-2023 Task 2: Multilingual Complex NER using XLM-RoBERTa. arxiv:2305.03300 [cs.CL]Google Scholar
- OpenAI. 2023. GPT-4 Technical Report. arxiv:2303.08774 [cs.CL]Google Scholar
- Juan Manuel Perez Martinez, Rafael Berlanga, Maria Jose Aramburu, and Torben Bach Pedersen. 2008. Integrating Data Warehouses with Web Data: A Survey. IEEE Transactions on Knowledge and Data Engineering 20, 7 (2008), 940–955. https://doi.org/10.1109/TKDE.2007.190746Google ScholarDigital Library
- Fabio Petroni, Patrick Lewis, Aleksandra Piktus, Tim Rocktäschel, Yuxiang Wu, Alexander H. Miller, and Sebastian Riedel. 2020. How Context Affects Language Models’ Factual Predictions. arxiv:2005.04611 [cs.CL]Google Scholar
- Fabio Petroni, Tim Rocktäschel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, Alexander H Miller, and Sebastian Riedel. 2019. Language models as knowledge bases?arXiv preprint arXiv:1909.01066 (2019).Google Scholar
- Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, 2018. Improving language understanding by generative pre-training. (2018).Google Scholar
- Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.Google Scholar
- Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H. Chi, Nathanael Schärli, and Denny Zhou. 2023. Large Language Models Can Be Easily Distracted by Irrelevant Context. In Proceedings of the 40th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 202), Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (Eds.). PMLR, 31210–31227. https://proceedings.mlr.press/v202/shi23a.htmlGoogle Scholar
- Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, 2022. Large language models encode clinical knowledge. arXiv preprint arXiv:2212.13138 (2022).Google Scholar
- Xiaofei Sun, Xiaoya Li, Jiwei Li, Fei Wu, Shangwei Guo, Tianwei Zhang, and Guoyin Wang. 2023. Text Classification via Large Language Models. arxiv:2305.08377 [cs.CL]Google Scholar
- Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. 2020. Efficient transformers: A survey.(2020). arXiv preprint cs.LG/2009.06732 (2020).Google Scholar
- Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, YaGuang Li, Hongrae Lee, Huaixiu Steven Zheng, Amin Ghafouri, Marcelo Menegali, Yanping Huang, Maxim Krikun, Dmitry Lepikhin, James Qin, Dehao Chen, Yuanzhong Xu, Zhifeng Chen, Adam Roberts, Maarten Bosma, Vincent Zhao, Yanqi Zhou, Chung-Ching Chang, Igor Krivokon, Will Rusch, Marc Pickett, Pranesh Srinivasan, Laichee Man, Kathleen Meier-Hellstern, Meredith Ringel Morris, Tulsee Doshi, Renelito Delos Santos, Toju Duke, Johnny Soraker, Ben Zevenbergen, Vinodkumar Prabhakaran, Mark Diaz, Ben Hutchinson, Kristen Olson, Alejandra Molina, Erin Hoffman-John, Josh Lee, Lora Aroyo, Ravi Rajakumar, Alena Butryna, Matthew Lamm, Viktoriya Kuzmina, Joe Fenton, Aaron Cohen, Rachel Bernstein, Ray Kurzweil, Blaise Aguera-Arcas, Claire Cui, Marian Croak, Ed Chi, and Quoc Le. 2022. LaMDA: Language Models for Dialog Applications. arxiv:2201.08239 [cs.CL]Google Scholar
- Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. LLaMA: Open and Efficient Foundation Language Models. arxiv:2302.13971 [cs.CL]Google Scholar
- Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. arxiv:2307.09288 [cs.CL]Google Scholar
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).Google Scholar
- Shuhe Wang, Xiaofei Sun, Xiaoya Li, Rongbin Ouyang, Fei Wu, Tianwei Zhang, Jiwei Li, and Guoyin Wang. 2023. GPT-NER: Named Entity Recognition via Large Language Models. arxiv:2304.10428 [cs.CL]Google Scholar
- Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, and Gideon Mann. 2023. Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564 (2023).Google Scholar
- Junjie Ye, Xuanting Chen, Nuo Xu, Can Zu, Zekai Shao, Shichun Liu, Yuhan Cui, Zeyang Zhou, Chao Gong, Yang Shen, Jie Zhou, Siming Chen, Tao Gui, Qi Zhang, and Xuanjing Huang. 2023. A Comprehensive Capability Analysis of GPT-3 and GPT-3.5 Series Models. arxiv:2303.10420 [cs.CL]Google Scholar
- Seonghyeon Ye, Hyeonbin Hwang, Sohee Yang, Hyeongu Yun, Yireun Kim, and Minjoon Seo. 2023. In-Context Instruction Learning. arxiv:2302.14691 [cs.CL]Google Scholar
- Gokul Yenduri, Ramalingam M, Chemmalar Selvi G, Supriya Y, Gautam Srivastava, Praveen Kumar Reddy Maddikunta, Deepti Raj G, Rutvij H Jhaveri, Prabadevi B, Weizheng Wang, Athanasios V. Vasilakos, and Thippa Reddy Gadekallu. 2023. Generative Pre-trained Transformer: A Comprehensive Review on Enabling Technologies, Potential Applications, Emerging Challenges, and Future Directions. arxiv:2305.10435 [cs.CL]Google Scholar
- Zhebin Zhang, Sai Wu, Dawei Jiang, and Gang Chen. 2021. BERT-JAM: Maximizing the utilization of BERT for neural machine translation. Neurocomputing 460 (2021), 84–94.Google ScholarDigital Library
- Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. 2023. ExpeL: LLM Agents Are Experiential Learners. arxiv:2308.10144 [cs.LG]Google Scholar
- Chunting Zhou, Graham Neubig, Jiatao Gu, Mona Diab, Paco Guzman, Luke Zettlemoyer, and Marjan Ghazvininejad. 2020. Detecting hallucinated content in conditional neural sequence generation. arXiv preprint arXiv:2011.02593 (2020).Google Scholar
Index Terms
- Can Large Language Models Revolutionalize Open Government Data Portals? A Case of Using ChatGPT in statistics.gov.scot
Recommendations
Open Government Data Portals Analysis: The Brazilian Case
dg.o '16: Proceedings of the 17th International Digital Government Research Conference on Digital Government ResearchGovernments and public agencies are publishing data on the Web, called Open Government Data (OGD), without restrictions for its sharing and reuse. This growing interest has motivated Brazilian Government Administrations to launch Open Data initiatives. ...
Open government data: beyond policy & portal, a study in Indian context
ICEGOV '13: Proceedings of the 7th International Conference on Theory and Practice of Electronic GovernanceOpen data is expected to enhance transparency, accountability and collaboration with citizens for government. Governments at all levels across all continents are therefore taking Initiatives to release their data in open domain. Open government data ...
Open Government Data Policy and Indian Ecosystems
ICEGOV '17: Proceedings of the 10th International Conference on Theory and Practice of Electronic GovernanceIn a developing country like India, with complex issues at hand evidence-based Planning of socio-economic development processes must rely on quality data. As quality data is not easily accessible, there is a general need to facilitate sharing and ...
Comments