Abstract
Rocchio's similarity-based Relevance feedback algorithm, one of the most important query reformation methods in information retrieval, is essentially an adaptive supervised learning algorithm from examples. In spite of its popularity in various applications there is little rigorous analysis of its learning complexity in literature. In this paper we show that in the binary vector space model, if the initial query vector is 0, then for any of the four typical similarities (inner product, dice coefficient, cosine coefficient, and Jaccard coefficient), Rocchio's similarity-based relevance feedback algorithm makes at least n mistakes when used to search for a collection of documents represented by a monotone disjunction of at most k relevant features (or terms) over the n-dimensional binary vector space {0, 1}n. When an arbitrary initial query vector in {0, 1}n is used, it makes at least (n + k − 3)/2 mistakes to search for the same collection of documents. The linear lower bounds are independent of the choices of the threshold and coefficients that the algorithm may use in updating its query vector and making its classification.
Article PDF
Similar content being viewed by others
References
Angluin D (1987) Queries and concept learning. Machine Learning, 2(4):319–432.
Baeza-Yates R and Ribeiro-Neto B (1999) Eds. Modern Information Retrieval. Addison-Wesley, Essex, England.
Chen Z (2001) Multiplicative adaptive algorithms for user preference retrieval. In: Proceedings of the Seventh Annual International Computing and Combinatorics Conference. Springer-Verlag, pp. 540–549.
Chen Z and Meng X (2000) Yarrow: A real-time client site meta search learner. In: Proceedings of the AAAI 2000 Workshop on Artificial Intelligence for Web Search. AAAI Press, pp. 12–17.
Chen Z, Meng X, Fowler R and Zhu B (2001) FEATURES: Real-time adaptive feature learning and document learning. Journal of the American Society for Information Science, 52(8):655–665.
Chen Z, Meng X, Zhu B and Fowler R (2000)WebSail: From on-line learning to web search. In: Q. Li et al. Eds., Proceedings of the 2000 International Conference on Web Information Systems Engineering (the full version will appear in Journal of Knowledge and Information Science, the special issue of WISE'00). IEEE Press, pp. 192–199.
Frakes W and Baeza-Yates R (1992), Eds. Information Retrieval: Data Structures and Algorithms. Prentice Hall.
Ide E. (1971a) Interactive search strategies and dynamic file organization in information retrieval. In: Salton G, Ed., The Smart System-Experiments in Automatic Document Processing. Prentice-Hall, Englewood Cliffs, NJ, pp. 373–393.
Ide E (1971b) New experiments in relevance feedback. In: Salton G, Ed., The Smart System-Experiments in Automatic Document Processing. Prentice-Hall, Englewood Cliffs, NJ, pp. 337–354.
Kivinen J, Warmuth M and Auer P (1997) The perceptron algorithm vs. Winnow: linear vs. logarithmic mistake bounds when few input variables are relevant. Artificial Intelligence, 97(1-2):325–343.
Lewis D (1991) Learning in intelligent information retrieval. In: Proceedings of the Eighth InternationalWorkshop on Machine Learning, pp. 235–239.
Littlestone N (1988) Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm.Machine Learning, 2:285–318.
Maass W and Turán G (1994) How fast can a threshold gate learn?. Computational Learning Theory and Natural Learning Systems, 1:381–414.
Maass W and Warmuth M (1998) Efficient learning with virtual threshold gates. Information and Computation 141(1):66–83.
Papadimitriou C, Raghavan P and Tamaki H (2000) Latent semantic indexing: A probabilistic analysis. Journal of Computer and System Science, 61(2):217–235.
Raghavan V and Wong S (1986) A critical analysis of the vector space model for information retrieval. Journal of the American Society for Information Science, 37(5):279–287.
Rocchio J (1971) Relevance feedback in information retrieval. In: Salton G, Ed., The Smart Retrieval System- Experiments in Automatic Document Processing. Prentice-Hall, Englewood Cliffs, NJ, pp. 313–323.
Rosenblatt F (1958) The perceptron: A probabilistic model for information storage and organization in the brain.Psychological Review, 65(6):386–407.
Salton G (1989), Ed. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading, MA.
Salton G and Buckley C (1990) Improving retrieval performance by relevance feedback. Journal of the American Society for Information Science, 41(4):288–297.
Salton G, Wong S and Yang C (1975) A vector space model for automatic indexing. Comm. of ACM, 18(11):613–620.
Sclaroff S, Taycher L and Cascia M(1997) ImageRover: A content-based image browser for theWorldWideWeb.In: Proceedings of the IEEE Workshop on Content-based Access of Image and Video Libraries. IEEE Press, pp. 2–9.
Taycher L. Cascia M and Sclaroff S (1997) Image digestion and relevance feedback in the ImageRover WWW search engines. In: Proceedings of the International Conference on Visual Information, pp. 85–92.
Wong S, Yao Y and Bollmann P (1988) Linear structures in information retrieval. In: Proceedings of the 1988 ACM-SIGIR Conference on Information Retrieval, pp. 219–232.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Chen, Z., Zhu, B. Some Formal Analysis of Rocchio's Similarity-Based Relevance Feedback Algorithm. Information Retrieval 5, 61–86 (2002). https://doi.org/10.1023/A:1012730924277
Issue Date:
DOI: https://doi.org/10.1023/A:1012730924277