Topic Set Size Design with the Evaluation Measures for Short Text Conversation

Sakai, Tetsuya; Shang, Lifeng; Lu, Zhengdong; Li, Hang

doi:10.1007/978-3-319-28940-3_25

Tetsuya Sakai¹⁹,
Lifeng Shang²⁰,
Zhengdong Lu²⁰ &
…
Hang Li²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9460))

Included in the following conference series:

AIRS

785 Accesses
2 Citations

Abstract

Short Text Conversation (STC) is a new NTCIR task which tackles the following research question: given a microblog repository and a new post to that microblog, can systems reuse an old comment from the respository to satisfy the author of the new post? The official evaluation measures of STC are normalised gain at 1 (nG@1), normalised expected reciprocal rank at 10 (nERR@10), and P \(^+\), all of which can be regarded as evaluation measures for navigational intents. In this study, we apply the topic set size design technique of Sakai to decide on the number of test topics, using variance estimates of the above evaluation measures. Our main conclusion is to create 100 test topics, but what distinguishes our work from other tasks with similar topic set sizes is that we know what this topic set size means from a statistical viewpoint for each of our evaluation measures. We also demonstrate that, under the same set of statistical requirements, the topic set sizes required by nERR@10 and P\(^+\) are more or less the same, while nG@1 requires more than twice as many topics. To our knowledge, our task is the first among all efforts at TREC-like evaluation conferences to actually create a new test collection by using this principled approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://ntcir12.noahlab.com.hk/stc.htm.
2.
http://research.nii.ac.jp/ntcir/.
3.
Examples taken from our arxiv paper: http://arxiv.org/pdf/1408.6988.pdf.
4.
http://weibo.com.
5.
A Japanese subtask using Twitter data is also in preparation.
6.
http://twitter.com.
7.
The minimum/average/maximum lengths of the 196,395 posts in the repository are 10/32.5/140, respectively. Whereas, after translating them into English using machine translation, the corresponding lengths are 11/115.7/724. This suggests that a Chinese tweet can be 3–5 times as informative as an English one.
8.
While the present study uses the post-comment labels collected as described in the arxiv paper, we have since then revised the labelling criteria in order to clarify several different axes for labelling, including coherence and usefulness. The new labelling scheme will be used to revise the training data labels as well as to construct the official test data labels.
9.
nG@1 is sometimes referred to as nDCG@1; however, note that neither discounting (“D”) nor cumulating gains (“C”) is applied at rank 1.
10.
Given an input remark “Men are all alike,” ELIZA, the rule-based system developed in the 1960s, could respond: “IN WHAT WAY?” [21] .
11.
http://www.f.waseda.jp/tetsuya/CIKM2014/samplesizeANOVA.xlsx.
12.
http://research.nii.ac.jp/ntcir/tools/ntcireval-en.html.
13.
Note that Average Precision and Q-measure assume a uniform distribution over all relevant documents, so that the stopping probability each relevant document is 1 / R, where R is the total number of relevant documents [16].
14.
http://research.nii.ac.jp/ntcir/tools/discpower-en.html.
15.
The effect size here is essentially the difference between a system pair as measured in standard deviation units, after removing the between-system and between-topic effects.
16.
When \(m=2\), one-way ANOVA is equivalent to the unpaired t-test.

References

Broder, A.: A taxonomy of web search. SIGIR Forum 36(2), 3–10 (2002)
Article Google Scholar
Chapelle, O., Ji, S., Liao, C., Velipasaoglu, E., Lai, L., Wu, S.L.: Intent-based diversification of web search results: metrics and algorithms. Inf. Retrieval 14(6), 572–592 (2011)
Article Google Scholar
Ellis, P.D.: The Essential Guide to Effect Sizes. Cambridge University Press, New York (2010)
Google Scholar
Higashinaka, R., Kawamae, N., Sadamitsu, K., Minami, Y., Meguro, T., Dohsaka, K., Inagaki, H.: Building a conversational model from two-tweets. In: Proceedings of IEEE ASRU 2011 (2011)
Google Scholar
Jafarpour, S., Burges, C.J.: Filter, rank and transfer the knowledge: learning to chat. Technical report, MSR-TR-2010-93 (2010)
Google Scholar
Järvelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of IR techniques. ACM TOIS 20(4), 422–446 (2002)
Article Google Scholar
Lin, J., Efron, M.: Overview of the TREC-2013 microblog track. In: Proceedings of TREC 2013 (2014)
Google Scholar
Mitamura, T., Shima, H., Sakai, T., Kando, N., Mori, T., Takeda, K., Lin, C.Y., Song, R., Lin, C.J., Lee, C.W.: Overview of the NTCIR-8 ACLIA tasks: advanced cross-lingual information access. In: Proceedings of NTCIR-8, pp. 15–24 (2010)
Google Scholar
Nagata, Y.: How to Design the Sample Size (in Japanese), Asakura Shoten (2003)
Google Scholar
Ritter, A., Cherry, C., Dolan, W.B.: Data-driven response generation in social media. Proc. EMNLP 2011, 583–593 (2011)
Google Scholar
Sakai, T.: Bootstrap-based comparisons of IR metrics for finding one relevant document. In: Ng, H.T., Leong, M.-K., Kan, M.-Y., Ji, D. (eds.) AIRS 2006. LNCS, vol. 4182, pp. 374–389. Springer, Heidelberg (2006)
Chapter Google Scholar
Sakai, T.: Statistical reform in information retrieval? SIGIR Forum 48(1), 3–12 (2014)
Article Google Scholar
Sakai, T.: Information Access Evaluation Methodology: For the Progress of Search Engines (in Japanese), Coronasha (2015)
Google Scholar
Sakai, T.: Topic set size design. Information Retrieval Journal (submitted) (2015)
Google Scholar
Sakai, T., Ishikawa, D., Kando, N., Seki, Y., Kuriyama, K., Lin, C.Y.: Using graded-relevance metrics for evaluating community QA answer selection. Proc. ACM WSDM 2011, 187–196 (2011)
Google Scholar
Sakai, T., Robertson, S.: Modelling a user population for designing information retrieval metrics. Proc. EVIA 2008, 30–41 (2008)
Google Scholar
Sanderson, M., Zobel, J.: Information retrieval system evaluation: effort, sensitivity, and reliability. Proc. ACM SIGIR 2005, 162–169 (2005)
Google Scholar
Shibuki, H., Sakamoto, K., Kano, Y., Mitamura, T., Ishioroshi, M., Itakura, K., Wang, D., Mori, T., Kando, N.: Overview of the NTCIR-11 QA-Lab task. In: Proceedings of NTCIR-11, pp. 518–529 (2014)
Google Scholar
Voorhees, E.M., Harman, D.K. (eds.): TREC: Experiment and Evaluation in Information Retrieval. The MIT Press, Cambridge (2005)
Google Scholar
Webber, W., Moffat, A., Zobel, J.: Statistical power in retrieval experimentation. Proc. ACM CIKM 2008, 571–580 (2008)
Google Scholar
Weizenbaum, J.: ELIZA - a computer program for the study of natural language communication between man and machine. Commun. ACM 9(1), 36–45 (1966)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Waseda University, Tokyo, Japan
Tetsuya Sakai
Noah’s Ark Lab, Huawei, Hong Kong
Lifeng Shang, Zhengdong Lu & Hang Li

Authors

Tetsuya Sakai
View author publications
You can also search for this author in PubMed Google Scholar
Lifeng Shang
View author publications
You can also search for this author in PubMed Google Scholar
Zhengdong Lu
View author publications
You can also search for this author in PubMed Google Scholar
Hang Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tetsuya Sakai .

Editor information

Editors and Affiliations

Science and Engineering Faculty, Queensland University of Technology, Brisbane, Australia
Guido Zuccon
Brisbane, Queensland, Australia
Shlomo Geva
University of Tsukuba, Ibaraki, Japan
Hideo Joho
RMIT University, Melbourne, Australia
Falk Scholer
School of Computer Engineering, Nanyang Technological University, Singapore, Singapore
Aixin Sun
Tianjin University, China
Peng Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sakai, T., Shang, L., Lu, Z., Li, H. (2015). Topic Set Size Design with the Evaluation Measures for Short Text Conversation. In: Zuccon, G., Geva, S., Joho, H., Scholer, F., Sun, A., Zhang, P. (eds) Information Retrieval Technology. AIRS 2015. Lecture Notes in Computer Science(), vol 9460. Springer, Cham. https://doi.org/10.1007/978-3-319-28940-3_25

Download citation

DOI: https://doi.org/10.1007/978-3-319-28940-3_25
Published: 22 January 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-28939-7
Online ISBN: 978-3-319-28940-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics