Comparative Analysis of Relevance for SVM-Based Interactive Document Retrieval

Hiroshi Murata; Takashi Onoda; Seiji Yamada

doi:10.20965/jaciii.2013.p0149

single-jc.php

« previous

JACIII Vol.17 No.2 pp. 149-156

doi: 10.20965/jaciii.2013.p0149

(2013)

Paper:

Views over last 60 days: 596

Comparative Analysis of Relevance for SVM-Based Interactive Document Retrieval

Hiroshi Murata^, Takashi Onoda^, and Seiji Yamada^**

^*Central Research Institute of Electric Power Industry (CRIEPI), 2-11-1 Iwado kita, Komae-shi, Tokyo 201-8511, Japan

^**National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo 101-8430, Japan

Received:

July 26, 2012

Accepted:

December 20, 2012

Published:

March 20, 2013

Keywords:

interactive document retrieval, support vector machines, relevance feedback, kernel method

Abstract

Support Vector Machines (SVMs) were applied to interactive document retrieval that uses active learning. In such a retrieval system, the degree of relevance is evaluated by using a signed distance from the optimal hyperplane. It is not clear, however, how the signed distance in SVMs has characteristics of vector space model. We therefore formulated the degree of relevance by using the signed distance in SVMs and comparatively analyzed it with a conventional Rocchio-based method. Although vector normalization has been utilized as preprocessing for document retrieval, few studies explained why vector normalization was effective. Based on our comparative analysis, we theoretically show the effectiveness of normalizing document vectors in SVM-based interactive document retrieval. We then propose a cosine kernel that is suitable for SVM-based interactive document retrieval. The effectiveness of the method was compared experimentally with conventional relevance feedback for Boolean, Term Frequency and Term Frequency-Inverse Document Frequency representations of document vectors. Experimental results for a Text REtrieval Conference data set showed that the cosine kernel is effective for all document representations, especially Term Frequency representation.

Cite this article as:

H. Murata, T. Onoda, and S. Yamada, “Comparative Analysis of Relevance for SVM-Based Interactive Document Retrieval,” J. Adv. Comput. Intell. Intell. Inform., Vol.17 No.2, pp. 149-156, 2013.

Data files:

References

[1] G. Salton, (Ed.), “The SMART Retrieval System – Experiments in Automatic Document Processing,” Prentice Hall, Englewood, Cliffs, New Jersey, 1971.
[2] P. Ingwersen, “Information Retrieval Interaction,” Taylor Graham, 1992.
[3] J. Koenemann and N. J. Belkin, “A case for interaction: a study of interactive information retrieval behavior and effectiveness,” In Proc. of 27th Annual SIGCHI Conf. on Human factors in Computing Systems, pp. 205-212, 1996.
[4] G. Salton and M. J. McGill, “Introduction to Modern Information Retrieval,” McGraw-Hill, 1983.
[5] M. Okabe and S. Yamada, “Learning filtering rulesets for ranking refinement in relevance feedback,” Knowledge-Based Systems, Vol.18, pp. 117-124, April 2005.
[6] V. Vapnik, “Statistical Learning Theory,” John Wiley and Sons Inc., 1998.
[7] H. Drucker, B. Shahrary, and D. C. Gibbon, “Support vector machines: relevance feedback and information retrieval,” Information Processing & Management, Vol.38, pp. 305-323, May 2002.
[8] S. Tong and D. Koller, “Support vector machine active learning with applications to text classification,” J. of Machine Learning Research, Vol.2, pp. 45-66, 2002.
[9] G. Salton and C. Buckley, “Term-weighting approaches in automatic text retrieval,” In Information Processing and Management, pp. 513-523, 1988.
[10] K. Hotta, “Local normalized linear summation kernel for fast and robust recognition,” Pattern Recognition, Vol.43, pp. 906-913, March 2010.
[11] H. Murata, T. Onoda, and S. Yamada, “Comparative Analysis of Relevance Evaluation for Interactive Document Retrieval Based on SVMs (in Japanese),” J. of Japan Society for Fuzzy Theory and Intelligent Informatics, Vol.23, No.6, pp. 853-862, 2011.
[12] A. Moschitti, “A Study on Optimal Parameter Tuning for Rocchio Text Classifier,” In Proc. of the 25th European Conf. on Information Retrieval Research (ECIR ’03), pp. 420-435, 2003.
[13] Y. Lv and C. Zhai, “Adaptive Relevance Feedback in Information Retrieval,” In Proc. of the 18th ACM Conf. on Int. Knowledge Management, pp. 255-264, 2009.
[14] J. Montgomery, L. Si, J. Callan, and D. A. Evans, “Effect of varying number of documents in blind feedback: analysis of the 2003 NRRC RIA workshop“bf numdocs”experiment suite,” In Proc. of 27th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 476-477, 2004.
[15] T. Onoda, H. Murata, and S. Yamada, “SVM-based interactive document retrieval with active learning,” New Generation Computing, Vol.26, pp. 49-61, November 2007.
[16] M. Gamon, S. Basu, D. Belenko, D. Fisher, M. Hurst, and A. C. König, “BLEWS: Using Blogs to Provide Context for News Articles,” In Proc. of Int. Conf. on Weblogs and Social Media, 2008.
[17] M. Klein and M. L. Nelson, “Correlation of Term Count and Document Frequency for Google N-Grams,” In Proc. of the 31th European Conf. on IR Research on Advances in Information Retrieval (ECIR ’09), pp. 620-627, 2009.

This article is published under a Creative Commons Attribution-NoDerivatives 4.0 Internationa License.

[1] [1] G. Salton, (Ed.), “The SMART Retrieval System – Experiments in Automatic Document Processing,” Prentice Hall, Englewood, Cliffs, New Jersey, 1971.

[2] [2] P. Ingwersen, “Information Retrieval Interaction,” Taylor Graham, 1992.

[3] [3] J. Koenemann and N. J. Belkin, “A case for interaction: a study of interactive information retrieval behavior and effectiveness,” In Proc. of 27th Annual SIGCHI Conf. on Human factors in Computing Systems, pp. 205-212, 1996.

[4] [4] G. Salton and M. J. McGill, “Introduction to Modern Information Retrieval,” McGraw-Hill, 1983.

[5] [5] M. Okabe and S. Yamada, “Learning filtering rulesets for ranking refinement in relevance feedback,” Knowledge-Based Systems, Vol.18, pp. 117-124, April 2005.

[6] [6] V. Vapnik, “Statistical Learning Theory,” John Wiley and Sons Inc., 1998.

[7] [7] H. Drucker, B. Shahrary, and D. C. Gibbon, “Support vector machines: relevance feedback and information retrieval,” Information Processing & Management, Vol.38, pp. 305-323, May 2002.

[8] [8] S. Tong and D. Koller, “Support vector machine active learning with applications to text classification,” J. of Machine Learning Research, Vol.2, pp. 45-66, 2002.

[9] [9] G. Salton and C. Buckley, “Term-weighting approaches in automatic text retrieval,” In Information Processing and Management, pp. 513-523, 1988.

[10] [10] K. Hotta, “Local normalized linear summation kernel for fast and robust recognition,” Pattern Recognition, Vol.43, pp. 906-913, March 2010.

[11] [11] H. Murata, T. Onoda, and S. Yamada, “Comparative Analysis of Relevance Evaluation for Interactive Document Retrieval Based on SVMs (in Japanese),” J. of Japan Society for Fuzzy Theory and Intelligent Informatics, Vol.23, No.6, pp. 853-862, 2011.

[12] [12] A. Moschitti, “A Study on Optimal Parameter Tuning for Rocchio Text Classifier,” In Proc. of the 25th European Conf. on Information Retrieval Research (ECIR ’03), pp. 420-435, 2003.

[13] [13] Y. Lv and C. Zhai, “Adaptive Relevance Feedback in Information Retrieval,” In Proc. of the 18th ACM Conf. on Int. Knowledge Management, pp. 255-264, 2009.

[14] [14] J. Montgomery, L. Si, J. Callan, and D. A. Evans, “Effect of varying number of documents in blind feedback: analysis of the 2003 NRRC RIA workshop“bf numdocs”experiment suite,” In Proc. of 27th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 476-477, 2004.

[15] [15] T. Onoda, H. Murata, and S. Yamada, “SVM-based interactive document retrieval with active learning,” New Generation Computing, Vol.26, pp. 49-61, November 2007.

[16] [16] M. Gamon, S. Basu, D. Belenko, D. Fisher, M. Hurst, and A. C. König, “BLEWS: Using Blogs to Provide Context for News Articles,” In Proc. of Int. Conf. on Weblogs and Social Media, 2008.

[17] [17] M. Klein and M. L. Nelson, “Correlation of Term Count and Document Frequency for Google N-Grams,” In Proc. of the 31th European Conf. on IR Research on Advances in Information Retrieval (ECIR ’09), pp. 620-627, 2009.

Comparative Analysis of Relevance for SVM-Based Interactive Document Retrieval

Hiroshi Murata*, Takashi Onoda*, and Seiji Yamada**

Hiroshi Murata^, Takashi Onoda^, and Seiji Yamada^**