Article

Deconstructing nuggets: the stability and reliability of complex question answering evaluation

Authors:

Pengyi ZhangAuthors Info & Claims

SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval

Pages 327 - 334

https://doi.org/10.1145/1277741.1277799

Published: 23 July 2007 Publication History

Abstract

A methodology based on "information nuggets" has recently emerged as the de facto standard by which answers to complex questions are evaluated. After several implementations in the TREC question answering tracks, the community has gained a better understanding of its many characteristics. This paper focuses on one particular aspect of the evaluation: the human assignment of nuggets to answer strings, which serves as the basis of the F-score computation. As a byproduct of the TREC 2006 ciQA task, identical answer strings were independently evaluated twice, which allowed us to assess the consistency of human judgments. Based on these results, we explored simulations of assessor behavior that provide a method to quantify scoring variations. Understanding these variations in turn lets researchers be more confident in their comparisons of systems.

References

[1]

J. Allan. HARD track overview in TREC 2005: High accuracy retrieval from documents. In Proceedings of TREC 2005.

[2]

C. Buckley and E. Voorhees. Retrieval evaluation with incomplete information. In Proceedings of the SIGIR 2004.

Digital Library

[3]

B. Carterette, J. Allan, and R. Sitaraman. Minimal test collections for retrieval evaluation. In Proceedings SIGIR 2006.

Digital Library

[4]

C. Cleverdon, J. Mills, and E. Keen. Factors determining the performance of indexing systems. Two volumes, ASLIB Cranfield Research Project, Cranfield, England, 1968.

[5]

W. Hildebrandt, B. Katz, and J. Lin. Answering definition questions with multiple knowledge sources. In Proceedings of HLT/NAACL 2004.

[6]

J. Lin. Evaluation of resources for question answering evaluation. In Proceedings SIGIR 2005.

Digital Library

[7]

J. Lin and D. Demner-Fushman. Automatically evaluating answers to definition questions. In Proceedings HLT/EMNLP 2005.

Digital Library

[8]

J. Lin and D. Demner-Fushman. Will pyramids built of nuggets topple over? In Proceedings of HLT/NAACL 2006.

Digital Library

[9]

J. Lin and B. Katz. Building a reusable test collection for question answering. JASIST, 57(7):851--861, 2006.

Digital Library

[10]

M. Sanderson and J. Zobel. Information retrieval system evaluation: Effort, sensitivity, and reliability. In Proceedings SIGIR 2005.

Digital Library

[11]

E. Voorhees. Overview of the TREC 2003 question answering track. In Proceedings of TREC 2003.

[12]

E. Voorhees. Overview of the TREC 2004 question answering track. In Proceedings of TREC 2004.

[13]

E. Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness. IP&M, 36(5):697--716, 2000.

Digital Library

[14]

E. Voorhees and H. Dang. Overview of the TREC 2005 question answering track. In Proceedings of TREC 2005.

[15]

E. Voorhees and D. Tice. Building a question answering test collection. In Proceedings SIGIR 2000.

Digital Library

[16]

J. Zobel. How reliable are the results of large-scale information retrieval experiments? In Proceedings SIGIR 1998.

Digital Library

Cited By

Zuo CMathur KKela DSalek Faramarzi NBanerjee R(2022)Beyond belief: a cross-genre study on perception and validation of health information onlineInternational Journal of Data Science and Analytics10.1007/s41060-022-00310-713:4(299-314)Online publication date: 2-Feb-2022
https://doi.org/10.1007/s41060-022-00310-7
Wang NAbel MBarthes JNegre E(2016)An answerer recommender system exploiting collaboration in CQA services2016 IEEE 20th International Conference on Computer Supported Cooperative Work in Design (CSCWD)10.1109/CSCWD.2016.7565988(198-203)Online publication date: May-2016
https://doi.org/10.1109/CSCWD.2016.7565988
Wang YSherman GLin JEfron MBaeza-Yates RLalmas MMoffat ARibeiro-Neto B(2015)Assessor Differences and User Preferences in Tweet Timeline GenerationProceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/2766462.2767699(615-624)Online publication date: 9-Aug-2015
https://dl.acm.org/doi/10.1145/2766462.2767699
Show More Cited By

Index Terms

Deconstructing nuggets: the stability and reliability of complex question answering evaluation
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results

Recommendations

Human question answering performance using an interactive document retrieval system
IIIX '12: Proceedings of the 4th Information Interaction in Context Symposium

Every day, people answer their questions by using document retrieval systems. Compared to document retrieval systems, question answering (QA) systems aim to speed the rate at which users find answers by retrieving answers rather than documents. To ...
Liberal relevance criteria of TREC -: counting on negligible documents?
SIGIR '02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval

Most test collections (like TREC and CLEF) for experimental research in information retrieval apply binary relevance assessments. This paper introduces a four-point relevance scale and reports the findings of a project in which TREC-7 and TREC-8 ...
Opinion retrieval from blogs
CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management

Opinion retrieval is a document retrieval process, which requires documents to be retrieved and ranked according to their opinions about a query topic. A relevant document must satisfy two criteria: relevant to the query topic, and contains opinions ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval

July 2007

946 pages

ISBN:9781595935977

DOI:10.1145/1277741

General Chairs:
Wessel Kraaij
TNO, The Netherlands
,
Arjen P. de Vries
CWI, The Netherlands
,
Program Chairs:
Charles L. A. Clarke
University of Waterloo, Canada
,
Norbert Fuhr
University of Duisburg-Essen, Germany
,
Noriko Kando
National Institute of Informatics, Japan

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 July 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

SIGIR07

Sponsor:

SIGIR07: The 30th Annual International SIGIR Conference

July 23 - 27, 2007

Amsterdam, The Netherlands

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
519
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)1

Reflects downloads up to 07 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zuo CMathur KKela DSalek Faramarzi NBanerjee R(2022)Beyond belief: a cross-genre study on perception and validation of health information onlineInternational Journal of Data Science and Analytics10.1007/s41060-022-00310-713:4(299-314)Online publication date: 2-Feb-2022
https://doi.org/10.1007/s41060-022-00310-7
Wang NAbel MBarthes JNegre E(2016)An answerer recommender system exploiting collaboration in CQA services2016 IEEE 20th International Conference on Computer Supported Cooperative Work in Design (CSCWD)10.1109/CSCWD.2016.7565988(198-203)Online publication date: May-2016
https://doi.org/10.1109/CSCWD.2016.7565988
Wang YSherman GLin JEfron MBaeza-Yates RLalmas MMoffat ARibeiro-Neto B(2015)Assessor Differences and User Preferences in Tweet Timeline GenerationProceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/2766462.2767699(615-624)Online publication date: 9-Aug-2015
https://dl.acm.org/doi/10.1145/2766462.2767699
Ruthven I(2014)Relevance behaviour in TRECJournal of Documentation10.1108/JD-02-2014-003170:6(1098-1117)Online publication date: 7-Oct-2014
https://doi.org/10.1108/JD-02-2014-0031
Agichtein ELiu YBian J(2009)Modeling information-seeker satisfaction in community question answeringACM Transactions on Knowledge Discovery from Data10.1145/1514888.15148933:2(1-27)Online publication date: 21-Apr-2009
https://dl.acm.org/doi/10.1145/1514888.1514893
Junaini SChui L(2009)Learning Sarawak Local Malay Dialect Using Pedagogical AgentProceedings of the 2009 International Conference on Computer Technology and Development - Volume 0110.1109/ICCTD.2009.81(463-467)Online publication date: 13-Nov-2009
https://dl.acm.org/doi/10.1109/ICCTD.2009.81
Li BLiu YAgichtein ELapata MNg H(2008)CoCQAProceedings of the Conference on Empirical Methods in Natural Language Processing10.5555/1613715.1613836(937-946)Online publication date: 25-Oct-2008
https://dl.acm.org/doi/10.5555/1613715.1613836
Liu YBian JAgichtein EChua TLeong MMyaeng SOard DSebastiani F(2008)Predicting information seeker satisfaction in community question answeringProceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval10.1145/1390334.1390417(483-490)Online publication date: 20-Jul-2008
https://dl.acm.org/doi/10.1145/1390334.1390417

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten