skip to main content
10.1145/1076034.1076102acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

Evaluation of resources for question answering evaluation

Published: 15 August 2005 Publication History

Abstract

Controlled and reproducible laboratory experiments, enabled by reusable test collections, represent a well-established methodology in modern information retrieval research. In order to confidently draw conclusions about the performance of different retrieval methods using test collections, their reliability and trustworthiness must first be established. Although such studies have been performed for ad hoc test collections, currently available resources for evaluating question answering systems have not been similarly analyzed. This study evaluates the quality of answer patterns and lists of relevant documents currently employed in automatic question answering evaluation, and concludes that they are not suitable for post-hoc experimentation. These resources, created from runs submitted by TREC QA track participants, do not produce fair and reliable assessments of systems that did not participate in the original evaluations. Potential solutions for addressing this evaluation gap and their shortcomings are discussed.

References

[1]
E. Brill, J. Lin, M. Banko, S. Dumais, and A. Ng. Data-intensive question answering. In Proceedings of the Tenth Text REtrieval Conference (TREC 2001), 2001.
[2]
C. Buckley and E. M. Voorhees. Evaluating evaluation measure stability. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2000), 2000.
[3]
C. Buckley and E. M. Voorhees. Retrieval evaluation with incomplete information. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2004), 2004.
[4]
C. Cieri, S. Strassel, D. Graff, N. Martey, K. Rennert, and M. Liberman. Corpora for topic detection and tracking. In J. Allan, editor, Topic Detection and Tracking: Event-Based Information Organization, pages 33--66. Kluwer Academic Publishers, Norwell, Massachusetts, 2002.
[5]
C. W. Cleverdon, J. Mills, and E. M. Keen. Factors determining the performance of indexing systems. Two volumes, ASLIB Cranfield Research Project, Cranfield, England, 1968.
[6]
G. V. Cormack, C. R. Palmer, and C. L. A. Clarke. Efficient construction of large test collections. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1998), 1998.
[7]
J. Lin, A. Fernandes, B. Katz, G. Marton, and S. Tellex. Extracting answers from the Web using knowledge annotation and knowledge mining techniques. In Proceedings of the Eleventh Text REtrieval Conference (TREC 2002), 2002.
[8]
J. Lin and B. Katz. Building a reusable test collection for question answering. Journal of the American Society for Information Science and Technology, 2005, in press.
[9]
J. Lin, D. Quan, V. Sinha, K. Bakshi, D. Huynh, B. Katz, and D. R. Karger. What makes a good answer? The role of context in question answering. In Proceedings of the Ninth IFIP TC13 International Conference on Human-Computer Interaction (INTERACT 2003), 2003.
[10]
C. Monz. From Document Retrieval to Question Answering. PhD thesis, Institute for Logic, Language, and Computation, University of Amsterdam, 2003.
[11]
E. Sormunen. Liberal relevance criteria of TREC--counting on negligible documents? In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2002), 2002.
[12]
E. M. Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1998), 1998.
[13]
E. M. Voorhees. Overview of the TREC-9 question answering track. In Proceedings of the Ninth Text REtrieval Conference (TREC-9), 2000.
[14]
E. M. Voorhees. Evaluating the evaluation: A case study using the TREC 2002 question answering track. In Proceedings of the 2003 Human Language Technology Conference and the North American Chapter of the Association for Computational Linguistics Annual Meeting (HLT/NAACL 2003), 2003.
[15]
E. M. Voorhees and C. Buckley. The effect of topic set size on retrieval experiment error. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2002), 2002.
[16]
E. M. Voorhees and D. M. Tice. Building a question answering test collection. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2000), 2000.
[17]
J. Zobel. How reliable are the results of large-scale information retrieval experiments? In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1998), 1998.

Cited By

View all

Index Terms

  1. Evaluation of resources for question answering evaluation

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
    August 2005
    708 pages
    ISBN:1595930345
    DOI:10.1145/1076034
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 15 August 2005

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. pooling
    2. question answering

    Qualifiers

    • Article

    Conference

    SIGIR05
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)3
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 07 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2018)Exploring models for semantic category verificationInformation Systems10.5555/1595075.159591634:8(753-765)Online publication date: 29-Dec-2018
    • (2018)Alignment-based surface patterns for factoid question answering systemsIntegrated Computer-Aided Engineering10.5555/1576283.157628516:3(259-269)Online publication date: 23-Dec-2018
    • (2018)User simulations for evaluating answers to question seriesInformation Processing and Management: an International Journal10.1016/j.ipm.2006.06.00643:3(717-729)Online publication date: 29-Dec-2018
    • (2018)An analysis of evaluation campaigns in ad-hoc medical information retrieval: CLEF eHealth 2013 and 2014Information Retrieval Journal10.1007/s10791-018-9331-421:6(507-540)Online publication date: 3-May-2018
    • (2015)An Interactive Question-Answer System with Dialogue for a Receptionist Avatar2015 12th Latin American Robotics Symposium and 2015 3rd Brazilian Symposium on Robotics (LARS-SBR)10.1109/LARS-SBR.2015.54(360-365)Online publication date: Oct-2015
    • (2013)Finding more trustworthy answers: Various trustworthiness factors in question answeringJournal of Information Science10.1177/016555151347889339:4(509-522)Online publication date: Mar-2013
    • (2011)Towards semantic category verification with arbitrary precisionProceedings of the Third international conference on Advances in information retrieval theory10.5555/2040317.2040351(274-284)Online publication date: 12-Sep-2011
    • (2011)Automatic question answering for Turkish with pattern parsing2011 International Symposium on Innovations in Intelligent Systems and Applications10.1109/INISTA.2011.5946098(389-393)Online publication date: Jun-2011
    • (2011)Towards Semantic Category Verification with Arbitrary PrecisionAdvances in Information Retrieval Theory10.1007/978-3-642-23318-0_25(274-284)Online publication date: 2011
    • (2010)Question Answering for Portuguese: How Much Is Needed?Advances in Artificial Intelligence – SBIA 201010.1007/978-3-642-16138-4_18(173-182)Online publication date: 2010
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media