abstract

Machine Generated Explanations and Their Evaluation

Author:

SIGIR '24: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval

Page 3074

https://doi.org/10.1145/3626772.3657652

Published: 11 July 2024 Publication History

Get Access

Abstract

Rapid adoption of a new generation of LLMs has demonstrated their considerable capabilities. However, these models are far from infallible, raising significant ethical concerns, especially in decision-making applications, prompting calls for increased restraint [2].

The Augmented Intelligence paradigm is one proposed mitigation. Therein LLMs are tools used by human decision makers to improve performance without corresponding loss of accountability.

However, this mitigation imposes requirements on models that are not the primary focus of existing evaluation approaches. In particular, current explanation evaluation approaches tend to prioritize premises and conclusions over reasoning quality. It is evident that logical soundness is a crucial aspect of system operation, as the output must be interpretable to the user.

This work therefore proposes adopting a technique from programming language theory, wherein intermediate representations are employed to simplify the evaluation of code [3]. Rather than the model mapping directly from queries to solutions, code generation is used to produce an executable intermediate. An effect of this design is to shift the LLM from being a producer of solutions to the creator delegation plans. Production of a more structured output is expected to ease estimation of model reasoning quality via the comparison to golden solutions. Use of a bespoke representation, designed to take advantage of the particulars of automated code generation, aims to reduce the difficulty of this estimation. Use of a novel syntax is however, made challenging by the obvious absence of existing examples. It is impractical and undesirable for reasons of both cost and flexibility, to create the large numbers of examples which would be needed for conventional training.

Early efforts have therefore been concentrated on two main areas: developing a syntax and interpreter, and addressing the challenge of data sparsity. A well-designed syntax is crucial, not only because updates will necessitate revising an increasing number of established solutions, but also due to its expected impact on the overall system utility. The expressiveness of the syntax is particularly significant in this context. Excessive constraint sacrifices generality, while too much leniency results in a proliferation of semantically equivalent solutions, complicating comparisons with the gold solutions. The generated intermediate representation is executed by an interpreter to produce the end solution. Other than a small number of control flow statements all other statements in the language are parameterized calls to external tools such as retrieval systems or math expression evaluators. External tool usage is the primary motivator behind avoiding training via extensive, manually curated, examples: the ability to easily add or remove tools is highly desirable. In case of an error, the interpreter output may include explanations of constructs and available tools, error messages, and task-specific metrics. Whereas on success, tool output is substituted into the appropriate section of the representation, such that a completely evaluated intermediate includes all information necessary to construct the end natural language explanation. Transformation into this natural language explanation can then be undertaken by another LLM. In terms of solutions for data sparsity, an approach similar to that used by LLM agents in environment exploration is suggested [1]. For each query, a ranking of model generations based on interpreter output in conjunction with a scoring function is produced. Pairwise selection of stronger and weaker responses are used thereafter in a modified form of iterative Direct Preference Optimisation (DPO) [4].

Given the preliminary system we aim to test the system across a range of tasks with correspondence to the augmented intelligence paradigm such as multi-hop question answering.

References

[1]

Afra Feyza Akyürek, Ekin Akyürek, Aman Madaan, Ashwin Kalyan, Peter Clark, Derry Wijaya, and Niket Tandon. 2023. RL4F: Generating Natural Language Feedback with Reinforcement Learning for Repairing Model Outputs. arXiv:2305.08844 [cs.CL]

Google Scholar

[2]

Tianyu Cui, YanlingWang, Chuanpu Fu, Yong Xiao, Sijia Li, Xinhao Deng, Yunpeng Liu, Qinglin Zhang, Ziyi Qiu, Peiyang Li, et al. 2024. Risk Taxonomy, Mitigation, and Assessment Benchmarks of Large Language Model Systems. arXiv preprint arXiv:2401.05778 (2024).

Google Scholar

[3]

Jack J Garzella, Marek Baranowski, Shaobo He, and Zvonimir Rakamarić. 2020. Leveraging compiler intermediate representation for multi-and cross-language verification. In Verification, Model Checking, and Abstract Interpretation: 21st International Conference, VMCAI 2020, New Orleans, LA, USA, January 16-21, 2020, Proceedings 21. Springer, 90--111.

Google Scholar

[4]

Jing Xu, Andrew Lee, Sainbayar Sukhbaatar, and JasonWeston. 2023. Some things are more CRINGE than others: Preference Optimization with the Pairwise Cringe Loss. arXiv:2312.16682 [cs.CL]

Google Scholar

Index Terms

Machine Generated Explanations and Their Evaluation
1. Computing methodologies
  1. Artificial intelligence
2. Theory of computation
  1. Logic
    1. Logic and verification

Recommendations

Automatically Generated Supernodes for AST Interpreters Improve Virtual-Machine Performance
GPCE 2023: Proceedings of the 22nd ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences

Abstract syntax tree (AST) interpreters allow implementing programming languages in a straight-forward way. However, AST interpreters implemented in object-oriented languages, such as e.g. in Java, often suffer from two serious performance issues. ...
LLM4VV: Developing LLM-driven testsuite for compiler validation
Abstract
Large language models (LLMs) are a new and powerful tool for a wide span of applications involving natural language and demonstrate impressive code generation abilities. The goal of this work is to automatically generate tests and use these tests ...
Highlights
- LLMs enable basic test generation for compiler validation.
- Investigates suitable prompt engineering and fine-tuning techniques.
- The majority of passing tests are correct, while tests fail for a variety of reasons.
Forge: generating a high performance DSL implementation from a declarative specification
GPCE '13: Proceedings of the 12th international conference on Generative programming: concepts & experiences

Domain-specific languages provide a promising path to automatically compile high-level code to parallel, heterogeneous, and distributed hardware. However, in practice high performance DSLs still require considerable software expertise to develop and ...

Comments

Information & Contributors

Information

Published In

SIGIR '24: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 2024

3164 pages

ISBN:9798400704314

DOI:10.1145/3626772

General Chairs:
Grace Hui Yang
Georgetown University, USA
,
Hongning Wang
Tsinghua University, China
,
Sam Han
The Washington Post, USA
,
Program Chairs:
Claudia Hauff
Spotify, Netherlands
,
Guido Zuccon
The University of Queensland, Australia
,
Yi Zhang
University of California Santa Cruz, USA

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 July 2024

Check for updates

Author Tags

Qualifiers

Abstract

Conference

SIGIR 2024

Sponsor:

SIGIR

SIGIR 2024: The 47th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 14 - 18, 2024

Washington DC, USA

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
59
Total Downloads

Downloads (Last 12 months)59
Downloads (Last 6 weeks)15

Reflects downloads up to 20 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Index Terms

Recommendations

Automatically Generated Supernodes for AST Interpreters Improve Virtual-Machine Performance

LLM4VV: Developing LLM-driven testsuite for compiler validation

Forge: generating a high performance DSL implementation from a declarative specification