Article

Evaluation of resources for question answering evaluation

Author:

Jimmy LinAuthors Info & Claims

SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval

Pages 392 - 399

https://doi.org/10.1145/1076034.1076102

Published: 15 August 2005 Publication History

Abstract

Controlled and reproducible laboratory experiments, enabled by reusable test collections, represent a well-established methodology in modern information retrieval research. In order to confidently draw conclusions about the performance of different retrieval methods using test collections, their reliability and trustworthiness must first be established. Although such studies have been performed for ad hoc test collections, currently available resources for evaluating question answering systems have not been similarly analyzed. This study evaluates the quality of answer patterns and lists of relevant documents currently employed in automatic question answering evaluation, and concludes that they are not suitable for post-hoc experimentation. These resources, created from runs submitted by TREC QA track participants, do not produce fair and reliable assessments of systems that did not participate in the original evaluations. Potential solutions for addressing this evaluation gap and their shortcomings are discussed.

References

[1]

E. Brill, J. Lin, M. Banko, S. Dumais, and A. Ng. Data-intensive question answering. In Proceedings of the Tenth Text REtrieval Conference (TREC 2001), 2001.

[2]

C. Buckley and E. M. Voorhees. Evaluating evaluation measure stability. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2000), 2000.

Digital Library

[3]

C. Buckley and E. M. Voorhees. Retrieval evaluation with incomplete information. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2004), 2004.

Digital Library

[4]

C. Cieri, S. Strassel, D. Graff, N. Martey, K. Rennert, and M. Liberman. Corpora for topic detection and tracking. In J. Allan, editor, Topic Detection and Tracking: Event-Based Information Organization, pages 33--66. Kluwer Academic Publishers, Norwell, Massachusetts, 2002.

Digital Library

[5]

C. W. Cleverdon, J. Mills, and E. M. Keen. Factors determining the performance of indexing systems. Two volumes, ASLIB Cranfield Research Project, Cranfield, England, 1968.

[6]

G. V. Cormack, C. R. Palmer, and C. L. A. Clarke. Efficient construction of large test collections. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1998), 1998.

Digital Library

[7]

J. Lin, A. Fernandes, B. Katz, G. Marton, and S. Tellex. Extracting answers from the Web using knowledge annotation and knowledge mining techniques. In Proceedings of the Eleventh Text REtrieval Conference (TREC 2002), 2002.

[8]

J. Lin and B. Katz. Building a reusable test collection for question answering. Journal of the American Society for Information Science and Technology, 2005, in press.

Digital Library

[9]

J. Lin, D. Quan, V. Sinha, K. Bakshi, D. Huynh, B. Katz, and D. R. Karger. What makes a good answer? The role of context in question answering. In Proceedings of the Ninth IFIP TC13 International Conference on Human-Computer Interaction (INTERACT 2003), 2003.

Digital Library

[10]

C. Monz. From Document Retrieval to Question Answering. PhD thesis, Institute for Logic, Language, and Computation, University of Amsterdam, 2003.

[11]

E. Sormunen. Liberal relevance criteria of TREC--counting on negligible documents? In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2002), 2002.

Digital Library

[12]

E. M. Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1998), 1998.

Digital Library

[13]

E. M. Voorhees. Overview of the TREC-9 question answering track. In Proceedings of the Ninth Text REtrieval Conference (TREC-9), 2000.

[14]

E. M. Voorhees. Evaluating the evaluation: A case study using the TREC 2002 question answering track. In Proceedings of the 2003 Human Language Technology Conference and the North American Chapter of the Association for Computational Linguistics Annual Meeting (HLT/NAACL 2003), 2003.

Digital Library

[15]

E. M. Voorhees and C. Buckley. The effect of topic set size on retrieval experiment error. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2002), 2002.

Digital Library

[16]

E. M. Voorhees and D. M. Tice. Building a question answering test collection. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2000), 2000.

Digital Library

[17]

J. Zobel. How reliable are the results of large-scale information retrieval experiments? In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1998), 1998.

Digital Library

Cited By

Roussinov DTuretken O(2018)Exploring models for semantic category verificationInformation Systems10.5555/1595075.159591634:8(753-765)Online publication date: 29-Dec-2018
https://dl.acm.org/doi/10.5555/1595075.1595916
Sung CLee CYen HHsu W(2018)Alignment-based surface patterns for factoid question answering systemsIntegrated Computer-Aided Engineering10.5555/1576283.157628516:3(259-269)Online publication date: 23-Dec-2018
https://dl.acm.org/doi/10.5555/1576283.1576285
Lin J(2018)User simulations for evaluating answers to question seriesInformation Processing and Management: an International Journal10.1016/j.ipm.2006.06.00643:3(717-729)Online publication date: 29-Dec-2018
https://dl.acm.org/doi/10.1016/j.ipm.2006.06.006
Show More Cited By

Index Terms

Evaluation of resources for question answering evaluation
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results

Recommendations

On the reliability of factoid question answering evaluation

This paper compares some existing evaluation metrics for factoid question answering (QA) from the viewpoint of stability and sensitivity, using the NTCIR-4 QAC2 Japanese factoid QA tasks and the Buckley/Voorhees stability method and Voorhees/Buckley ...
Quantitative evaluation of passage retrieval algorithms for question answering
SIGIR '03: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval

Passage retrieval is an important component common to many question answering systems. Because most evaluations of question answering systems focus on end-to-end performance, comparison of common components becomes difficult. To address this shortcoming,...
Structured retrieval for question answering
SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval

Bag-of-words retrieval is popular among Question Answering (QA) system developers, but it does not support constraint checking and ranking on the linguistic and semantic information of interest to the QA system. We present anapproach to retrieval for QA,...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval

August 2005

708 pages

ISBN:1595930345

DOI:10.1145/1076034

General Chairs:
Ricardo Baeza-Yates
University of Chile, Chile
,
Nivio Ziviani
Federal University of Minas Gerais, Brazil
,
Program Chairs:
Gary Marchionini
University of North Carolina, USA
,
Alistair Moffat
University of Melbourne, Australia
,
John Tait
University of Sunderland, UK

Copyright © 2005 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 August 2005

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

SIGIR05

Sponsor:

SIGIR

SIGIR05: The 28th ACM/SIGIR International Symposium on Information Retrieval 2005

August 15 - 19, 2005

Salvador, Brazil

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
791
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)1

Reflects downloads up to 07 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Roussinov DTuretken O(2018)Exploring models for semantic category verificationInformation Systems10.5555/1595075.159591634:8(753-765)Online publication date: 29-Dec-2018
https://dl.acm.org/doi/10.5555/1595075.1595916
Sung CLee CYen HHsu W(2018)Alignment-based surface patterns for factoid question answering systemsIntegrated Computer-Aided Engineering10.5555/1576283.157628516:3(259-269)Online publication date: 23-Dec-2018
https://dl.acm.org/doi/10.5555/1576283.1576285
Lin J(2018)User simulations for evaluating answers to question seriesInformation Processing and Management: an International Journal10.1016/j.ipm.2006.06.00643:3(717-729)Online publication date: 29-Dec-2018
https://dl.acm.org/doi/10.1016/j.ipm.2006.06.006
Goeuriot LJones GKelly LLeveling JLupu MPalotti JZuccon G(2018)An analysis of evaluation campaigns in ad-hoc medical information retrieval: CLEF eHealth 2013 and 2014Information Retrieval Journal10.1007/s10791-018-9331-421:6(507-540)Online publication date: 3-May-2018
https://doi.org/10.1007/s10791-018-9331-4
Wantroba ERomero R(2015)An Interactive Question-Answer System with Dialogue for a Receptionist Avatar2015 12th Latin American Robotics Symposium and 2015 3rd Brazilian Symposium on Robotics (LARS-SBR)10.1109/LARS-SBR.2015.54(360-365)Online publication date: Oct-2015
https://doi.org/10.1109/LARS-SBR.2015.54
Oh HYoon YKim H(2013)Finding more trustworthy answers: Various trustworthiness factors in question answeringJournal of Information Science10.1177/016555151347889339:4(509-522)Online publication date: Mar-2013
https://doi.org/10.1177/0165551513478893
Roussinov D(2011)Towards semantic category verification with arbitrary precisionProceedings of the Third international conference on Advances in information retrieval theory10.5555/2040317.2040351(274-284)Online publication date: 12-Sep-2011
https://dl.acm.org/doi/10.5555/2040317.2040351
Celebi EGunel BSen B(2011)Automatic question answering for Turkish with pattern parsing2011 International Symposium on Innovations in Intelligent Systems and Applications10.1109/INISTA.2011.5946098(389-393)Online publication date: Jun-2011
https://doi.org/10.1109/INISTA.2011.5946098
Roussinov D(2011)Towards Semantic Category Verification with Arbitrary PrecisionAdvances in Information Retrieval Theory10.1007/978-3-642-23318-0_25(274-284)Online publication date: 2011
https://doi.org/10.1007/978-3-642-23318-0_25
Wilkens RVillavicencio A(2010)Question Answering for Portuguese: How Much Is Needed?Advances in Artificial Intelligence – SBIA 201010.1007/978-3-642-16138-4_18(173-182)Online publication date: 2010
https://doi.org/10.1007/978-3-642-16138-4_18
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten