skip to main content
10.1145/1277741.1277799acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

Deconstructing nuggets: the stability and reliability of complex question answering evaluation

Published: 23 July 2007 Publication History

Abstract

A methodology based on "information nuggets" has recently emerged as the de facto standard by which answers to complex questions are evaluated. After several implementations in the TREC question answering tracks, the community has gained a better understanding of its many characteristics. This paper focuses on one particular aspect of the evaluation: the human assignment of nuggets to answer strings, which serves as the basis of the F-score computation. As a byproduct of the TREC 2006 ciQA task, identical answer strings were independently evaluated twice, which allowed us to assess the consistency of human judgments. Based on these results, we explored simulations of assessor behavior that provide a method to quantify scoring variations. Understanding these variations in turn lets researchers be more confident in their comparisons of systems.

References

[1]
J. Allan. HARD track overview in TREC 2005: High accuracy retrieval from documents. In Proceedings of TREC 2005.
[2]
C. Buckley and E. Voorhees. Retrieval evaluation with incomplete information. In Proceedings of the SIGIR 2004.
[3]
B. Carterette, J. Allan, and R. Sitaraman. Minimal test collections for retrieval evaluation. In Proceedings SIGIR 2006.
[4]
C. Cleverdon, J. Mills, and E. Keen. Factors determining the performance of indexing systems. Two volumes, ASLIB Cranfield Research Project, Cranfield, England, 1968.
[5]
W. Hildebrandt, B. Katz, and J. Lin. Answering definition questions with multiple knowledge sources. In Proceedings of HLT/NAACL 2004.
[6]
J. Lin. Evaluation of resources for question answering evaluation. In Proceedings SIGIR 2005.
[7]
J. Lin and D. Demner-Fushman. Automatically evaluating answers to definition questions. In Proceedings HLT/EMNLP 2005.
[8]
J. Lin and D. Demner-Fushman. Will pyramids built of nuggets topple over? In Proceedings of HLT/NAACL 2006.
[9]
J. Lin and B. Katz. Building a reusable test collection for question answering. JASIST, 57(7):851--861, 2006.
[10]
M. Sanderson and J. Zobel. Information retrieval system evaluation: Effort, sensitivity, and reliability. In Proceedings SIGIR 2005.
[11]
E. Voorhees. Overview of the TREC 2003 question answering track. In Proceedings of TREC 2003.
[12]
E. Voorhees. Overview of the TREC 2004 question answering track. In Proceedings of TREC 2004.
[13]
E. Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness. IP&M, 36(5):697--716, 2000.
[14]
E. Voorhees and H. Dang. Overview of the TREC 2005 question answering track. In Proceedings of TREC 2005.
[15]
E. Voorhees and D. Tice. Building a question answering test collection. In Proceedings SIGIR 2000.
[16]
J. Zobel. How reliable are the results of large-scale information retrieval experiments? In Proceedings SIGIR 1998.

Cited By

View all
  • (2022)Beyond belief: a cross-genre study on perception and validation of health information onlineInternational Journal of Data Science and Analytics10.1007/s41060-022-00310-713:4(299-314)Online publication date: 2-Feb-2022
  • (2016)An answerer recommender system exploiting collaboration in CQA services2016 IEEE 20th International Conference on Computer Supported Cooperative Work in Design (CSCWD)10.1109/CSCWD.2016.7565988(198-203)Online publication date: May-2016
  • (2015)Assessor Differences and User Preferences in Tweet Timeline GenerationProceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/2766462.2767699(615-624)Online publication date: 9-Aug-2015
  • Show More Cited By

Index Terms

  1. Deconstructing nuggets: the stability and reliability of complex question answering evaluation

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
    July 2007
    946 pages
    ISBN:9781595935977
    DOI:10.1145/1277741
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 23 July 2007

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. complex information needs
    2. human judgments
    3. trec

    Qualifiers

    • Article

    Conference

    SIGIR07
    Sponsor:
    SIGIR07: The 30th Annual International SIGIR Conference
    July 23 - 27, 2007
    Amsterdam, The Netherlands

    Acceptance Rates

    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)5
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 07 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Beyond belief: a cross-genre study on perception and validation of health information onlineInternational Journal of Data Science and Analytics10.1007/s41060-022-00310-713:4(299-314)Online publication date: 2-Feb-2022
    • (2016)An answerer recommender system exploiting collaboration in CQA services2016 IEEE 20th International Conference on Computer Supported Cooperative Work in Design (CSCWD)10.1109/CSCWD.2016.7565988(198-203)Online publication date: May-2016
    • (2015)Assessor Differences and User Preferences in Tweet Timeline GenerationProceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/2766462.2767699(615-624)Online publication date: 9-Aug-2015
    • (2014)Relevance behaviour in TRECJournal of Documentation10.1108/JD-02-2014-003170:6(1098-1117)Online publication date: 7-Oct-2014
    • (2009)Modeling information-seeker satisfaction in community question answeringACM Transactions on Knowledge Discovery from Data10.1145/1514888.15148933:2(1-27)Online publication date: 21-Apr-2009
    • (2009)Learning Sarawak Local Malay Dialect Using Pedagogical AgentProceedings of the 2009 International Conference on Computer Technology and Development - Volume 0110.1109/ICCTD.2009.81(463-467)Online publication date: 13-Nov-2009
    • (2008)CoCQAProceedings of the Conference on Empirical Methods in Natural Language Processing10.5555/1613715.1613836(937-946)Online publication date: 25-Oct-2008
    • (2008)Predicting information seeker satisfaction in community question answeringProceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval10.1145/1390334.1390417(483-490)Online publication date: 20-Jul-2008

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media