skip to main content
10.1145/3626772.3657652acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
abstract

Machine Generated Explanations and Their Evaluation

Published: 11 July 2024 Publication History

Abstract

Rapid adoption of a new generation of LLMs has demonstrated their considerable capabilities. However, these models are far from infallible, raising significant ethical concerns, especially in decision-making applications, prompting calls for increased restraint [2].
The Augmented Intelligence paradigm is one proposed mitigation. Therein LLMs are tools used by human decision makers to improve performance without corresponding loss of accountability.
However, this mitigation imposes requirements on models that are not the primary focus of existing evaluation approaches. In particular, current explanation evaluation approaches tend to prioritize premises and conclusions over reasoning quality. It is evident that logical soundness is a crucial aspect of system operation, as the output must be interpretable to the user.
This work therefore proposes adopting a technique from programming language theory, wherein intermediate representations are employed to simplify the evaluation of code [3]. Rather than the model mapping directly from queries to solutions, code generation is used to produce an executable intermediate. An effect of this design is to shift the LLM from being a producer of solutions to the creator delegation plans. Production of a more structured output is expected to ease estimation of model reasoning quality via the comparison to golden solutions. Use of a bespoke representation, designed to take advantage of the particulars of automated code generation, aims to reduce the difficulty of this estimation. Use of a novel syntax is however, made challenging by the obvious absence of existing examples. It is impractical and undesirable for reasons of both cost and flexibility, to create the large numbers of examples which would be needed for conventional training.
Early efforts have therefore been concentrated on two main areas: developing a syntax and interpreter, and addressing the challenge of data sparsity. A well-designed syntax is crucial, not only because updates will necessitate revising an increasing number of established solutions, but also due to its expected impact on the overall system utility. The expressiveness of the syntax is particularly significant in this context. Excessive constraint sacrifices generality, while too much leniency results in a proliferation of semantically equivalent solutions, complicating comparisons with the gold solutions. The generated intermediate representation is executed by an interpreter to produce the end solution. Other than a small number of control flow statements all other statements in the language are parameterized calls to external tools such as retrieval systems or math expression evaluators. External tool usage is the primary motivator behind avoiding training via extensive, manually curated, examples: the ability to easily add or remove tools is highly desirable. In case of an error, the interpreter output may include explanations of constructs and available tools, error messages, and task-specific metrics. Whereas on success, tool output is substituted into the appropriate section of the representation, such that a completely evaluated intermediate includes all information necessary to construct the end natural language explanation. Transformation into this natural language explanation can then be undertaken by another LLM. In terms of solutions for data sparsity, an approach similar to that used by LLM agents in environment exploration is suggested [1]. For each query, a ranking of model generations based on interpreter output in conjunction with a scoring function is produced. Pairwise selection of stronger and weaker responses are used thereafter in a modified form of iterative Direct Preference Optimisation (DPO) [4].
Given the preliminary system we aim to test the system across a range of tasks with correspondence to the augmented intelligence paradigm such as multi-hop question answering.

References

[1]
Afra Feyza Akyürek, Ekin Akyürek, Aman Madaan, Ashwin Kalyan, Peter Clark, Derry Wijaya, and Niket Tandon. 2023. RL4F: Generating Natural Language Feedback with Reinforcement Learning for Repairing Model Outputs. arXiv:2305.08844 [cs.CL]
[2]
Tianyu Cui, YanlingWang, Chuanpu Fu, Yong Xiao, Sijia Li, Xinhao Deng, Yunpeng Liu, Qinglin Zhang, Ziyi Qiu, Peiyang Li, et al. 2024. Risk Taxonomy, Mitigation, and Assessment Benchmarks of Large Language Model Systems. arXiv preprint arXiv:2401.05778 (2024).
[3]
Jack J Garzella, Marek Baranowski, Shaobo He, and Zvonimir Rakamarić. 2020. Leveraging compiler intermediate representation for multi-and cross-language verification. In Verification, Model Checking, and Abstract Interpretation: 21st International Conference, VMCAI 2020, New Orleans, LA, USA, January 16-21, 2020, Proceedings 21. Springer, 90--111.
[4]
Jing Xu, Andrew Lee, Sainbayar Sukhbaatar, and JasonWeston. 2023. Some things are more CRINGE than others: Preference Optimization with the Pairwise Cringe Loss. arXiv:2312.16682 [cs.CL]

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '24: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval
July 2024
3164 pages
ISBN:9798400704314
DOI:10.1145/3626772
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 July 2024

Check for updates

Author Tags

  1. code generation
  2. large language models
  3. program verification

Qualifiers

  • Abstract

Conference

SIGIR 2024
Sponsor:

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 59
    Total Downloads
  • Downloads (Last 12 months)59
  • Downloads (Last 6 weeks)15
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media