Abstract
Recommender systems are now popular both commercially and in the research community, where many approaches have been suggested for providing recommendations. In many cases a system designer that wishes to employ a recommendater system must choose between a set of candidate approaches. A first step towards selecting an appropriate algorithm is to decide which properties of the application to focus upon when making this choice. Indeed, recommender systems have a variety of properties that may affect user experience, such as accuracy, robustness, scalability, and so forth. In this paper we discuss how to compare recommenders based on a set of properties that are relevant for the application. We focus on comparative studies, where a few algorithms are compared using some evaluation metric, rather than absolute benchmarking of algorithms. We describe experimental settings appropriate for making choices between algorithms. We review three types of experiments, starting with an offline setting, where recommendation approaches are compared without user interaction, then reviewing user studies, where a small group of subjects experiment with the system and report on the experience, and finally describe large scale online experiments, where real user populations interact with the system. In each of these cases we describe types of questions that can be answered, and suggest protocols for experimentation. We also discuss how to draw trustworthy conclusions from the conducted experiments. We then review a large set of properties, and explain how to evaluate systems given relevant properties. We also survey a large set of evaluation metrics in the context of the property that they evaluate.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
A reference to their origins in signal detection theory.
- 5.
Not to be confused with trust in the social network research, used to measure how much a user believes another user. Some literature on recommender systems uses such trust measurements to filter similar users [64].
References
R. Bailey, Design of Comparative Experiments, vol. 25 (Cambridge University Press, Cambridge, 2008)
D. Bamber, The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. J. Math. Psychol. 12, 387–415 (1975)
J. Beel, S. Langer, A comparison of offline evaluations, online evaluations, and user studies in the context of research-paper recommender systems, in International Conference on Theory and Practice of Digital Libraries (Springer, New York, 2015), pp. 153–168
Y. Benjamini, Y. Hochberg, Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B (Methodological) 57, 289–300 (1995)
P.J. Bickel, K.A. Ducksum, Mathematical Statistics: Ideas and Concepts (Holden-Day, San Francisco, 1977)
M. Boland, Native ads will drive 74% of all ad revenue by 2021. Business Insider 14, 2016
P. Bonhard, C. Harries, J. McCarthy, M.A. Sasse, Accounting for taste: using profile similarity to improve recommender systems, in CHI ’06: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, New York, NY, 2006 (ACM, New York, 2006), pp. 1057–1066
C. Boutilier, R.S. Zemel, Online queries for collaborative filtering, in In Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics, 2002
G.E.P. Box, W.G. Hunter, J.S. Hunter, Statistics for Experimenters (Wiley, New York, 1978)
K. Bradley, B. Smyth, Improving recommendation diversity, in Twelfth Irish Conference on Artificial Intelligence and Cognitive Science (2001), pp. 85–94
D. Braziunas, C. Boutilier, Local utility elicitation in GAI models. in Proceedings of the Twenty-first Conference on Uncertainty in Artificial Intelligence, Edinburgh, 2005, pp. 42–49
J.S. Breese, D. Heckerman, C.M. Kadie, Empirical analysis of predictive algorithms for collaborative filtering, in UAI, 1998
R. Burke, Evaluating the dynamic properties of recommendation algorithms. in Proceedings of the Fourth ACM Conference on Recommender Systems, RecSys ’10, New York (ACM, New York, 2010), pp. 225–228
Ò. Celma, P. Herrera, A new approach to evaluating novel recommendations, in RecSys ’08: Proceedings of the 2008 ACM Conference on Recommender systems, New York, NY (ACM, New York, 2008), pp. 179–186
P.-A. Chirita, W. Nejdl, C. Zamfir, Preventing shilling attacks in online recommender systems, in WIDM ’05: Proceedings of the 7th Annual ACM International Workshop on Web Information and Data Management, New York, NY (ACM, New York, 2005), pp. 67–74
H. Cramer, V. Evers, S. Ramlal, M. Someren, L. Rutledge, N. Stash, L. Aroyo, B. Wielinga, The effects of transparency on trust in and acceptance of a content-based art recommender. User Model. User-Adapted Interact. 18(5), 455–496 (2008)
P. Cremonesi, Y. Koren, R. Turrin, Performance of recommender algorithms on top-n recommendation tasks, in Proceedings of the Fourth ACM Conference on Recommender Systems (2010), pp. 39–46
M.F. Dacrema, P. Cremonesi, D. Jannach, Are we really making much progress? A worrying analysis of recent neural recommendation approaches, in Proceedings of the 13th ACM Conference on Recommender Systems (2019), pp. 101–109
A.S. Das, M. Datar, A. Garg, S. Rajaram, Google news personalization: scalable online collaborative filtering, in WWW ’07: Proceedings of the 16th International Conference on World Wide Web, New York, NY (ACM, New York, 2007), pp. 271–280
O. Dekel, C.D. Manning, Y. Singer, Log-linear models for label ranking, in NIPS’03 (2003), pages 1–1
J. Demšar, Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)
M. Deshpande, G. Karypis, Item-based top-N recommendation algorithms. ACM Trans. Inf. Syst. 22(1), 143–177 (2004)
G. Fischer, User modeling in human-computer interaction. User Model. User-Adapt. Interact. 11(1–2), 65–86 (2001)
D.M. Fleder, K. Hosanagar, Recommender systems and their impact on sales diversity, in EC ’07: Proceedings of the 8th ACM Conference on Electronic Commerce, New York, NY, 2007 (ACM, New York, 2007), pp. 192–199
D. Frankowski, D. Cosley, S. Sen, L. Terveen, J. Riedl, You are what you say: privacy risks of public mentions, in SIGIR ’06: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, NY, 2006 (ACM, New York, 2006), pp. 565–572
G.A. Fredricks, R.B. Nelsen, On the relationship between spearman’s rho and kendall’s tau for pairs of continuous random variables. J. Stat. Plan. Infer. 137(7), 2143–2150 (2007)
S. Frumerman, G. Shani, B. Shapira, O. Sar Shalom, Are all rejected recommendations equally bad? towards analysing rejected recommendations, in Proceedings of the 27th ACM Conference on User Modeling, Adaptation and Personalization (2019), pp. 157–165
Z. Gantner, S. Rendle, C. Freudenthaler, L. Schmidt-Thieme, Mymedialite: a free recommender system library. In Proceedings of the Fifth ACM Conference on Recommender systems (2011), pp. 305–308
F. Garcin, B. Faltings, O. Donatsch, A. Alazzawi, C. Bruttin, A. Huber, Offline and online evaluation of news recommender systems at swissinfo, in Proceedings of the 8th ACM Conference on Recommender systems (2014), pp. 169–176
T. George, A scalable collaborative filtering framework based on co-clustering. in Fifth IEEE International Conference on Data Mining (2005), pp. 625–628
A.G. Greenwald, Within-subjects designs: To use or not to use? Psychol. Bull. 83, 216–229 (1976)
G. Guo, J. Zhang, Z. Sun, N. Yorke-Smith, Librec: a java library for recommender systems, in UMAP Workshops, vol. 4. Citeseer, 2015
P. Haddawy, V. Ha, A. Restificar, B. Geisler, J. Miyamoto, Preference elicitation via theory refinement. J. Mach. Learn. Res. 4, 317–337 (2003)
C. Hayes, P. Cunningham, An on-line evaluation framework for recommender systems. Technical report, Trinity College Dublin, Department of Computer Science, 2002
X. He, J. Pan, O. Jin, T. Xu, B. Liu, T. Xu, Y. Shi, A. Atallah, R. Herbrich, S. Bowers, et al., Practical lessons from predicting clicks on ads at facebook, in Proceedings of the Eighth International Workshop on Data Mining for Online Advertising (2014), pp. 1–9
J.L. Herlocker, J.A. Konstan, J.T. Riedl, Explaining collaborative filtering recommendations, in CSCW ’00: Proceedings of the 2000 ACM conference on Computer Supported Cooperative Work, New York, NY (ACM, New York, 2000), pp. 241–250
J.L. Herlocker, J.A. Konstan, J.T. Riedl, An empirical analysis of design choices in neighborhood-based collaborative filtering algorithms. Inf. Retr. 5(4), 287–310 (2002). ISSN:1386-4564. http://dx.doi.org/10.1023/A:1020443909834
J.L. Herlocker, J.A. Konstan, L.G. Terveen, J.T. Riedl, Evaluating collaborative filtering recommender systems. ACM Trans. Inf. Syst. 22(1), 5–53 (2004). ISSN:1046-8188. http://doi.acm.org/10.1145/963770.963772
Y. Hijikata, T. Shimizu, S. Nishida, Discovery-oriented collaborative filtering for improving user satisfaction, in IUI ’09: Proceedings of the 13th International Conference on Intelligent User Interfaces, New York, NY (ACM, New York, 2009), pp. 67–76
R. Hu, P. Pu, A comparative user study on rating vs. personality quiz based preference elicitation methods, in IUI 0́9: Proceedings of the 13th International Conference on Intelligent User Interfaces, New York, NY (ACM, New York, 2009), pp. 367–372
R. Hu, P. Pu, A comparative user study on rating vs. personality quiz based preference elicitation methods, n IUI (2009), pp. 367–372
R. Hu, P. Pu, A study on user perception of personality-based recommender systems, in UMAP (2010), pp. 291–302
N. Hug, Surprise: a python library for recommender systems. J. Open Source Softw. 5(52), 2174 (2020)
A. Iovine, F. Narducci, G. Semeraro, Conversational recommender systems and natural language: a study through the converse framework. Decis. Support Syst. 131, 113250 (2020)
K. Järvelin, J. Kekäläinen, Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst. 20(4), 422–446 (2002). ISSN:1046-8188. http://doi.acm.org/10.1145/582415.582418
N. Jones, P. Pu, User technology adoption issues in recommender systems, in Networking and Electronic Conference, 2007
M. Jugovac, D. Jannach, M. Karimi, StreamingRec: a framework for benchmarking stream-based news recommenders, in Proceedings of the 12th ACM Conference on Recommender Systems (2018), pp. 269–273
S. Jung, J.L. Herlocker, J. Webster, Click data as implicit relevance feedback in web search. Inf. Process. Manage. 43(3), 791–807 (2007)
G. Karypis, Evaluation of item-based top-n recommendation algorithms, in CIKM ’01: Proceedings of the Tenth International Conference on Information and Knowledge Management, New York, NY (ACM, New York, 2001), pp. 247–254
M.G. Kendall, A new measure of rank correlation. Biometrika 30(1–2), 81–93 (1938)
M.G. Kendall, The treatment of ties in ranking problems. Biometrika 33(3), 239–251 (1945)
R. Kohavi, R. Longbotham, D. Sommerfield, R.M. Henne, Controlled experiments on the web: survey and practical guide. Data Min. Knowl. Discov. 18(1), 140–181 (2009)
R. Kohavi, A. Deng, B. Frasca, T. Walker, Y. Xu, N. Pohlmann, Online controlled experiments at large scale, in Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’13, New York, NY, 2013 (ACM, New York, 2013), pp. 1168–1176
J.A. Konstan, S.M. McNee, C.-N. Ziegler, R. Torres, N. Kapoor, J. Riedl, Lessons on applying automated recommender systems to information-seeking tasks, in AAAI, 2006
Y. Koren, R. Bell, C. Volinsky, Matrix factorization techniques for recommender systems. Computer 42(8), 30–37 (2009)
I. Koychev, I. Schwab, Adaptation to drifting user’s interests, in In Proceedings of ECML2000 Workshop: Machine Learning in New Information Age (2000), pp. 39–46
S.K. Lam, J. Riedl, Shilling recommender systems for fun and profit, in WWW ’04: Proceedings of the 13th International Conference on World Wide Web, New York, NY (ACM, New York, 2004), pp. 393–402
S.K. Lam, D. Frankowski, J. Riedl, Do you trust your recommendations? an exploration of security and privacy issues in recommender systems, in In Proceedings of the 2006 International Conference on Emerging Trends in Information and Communication Security (ETRICS), 2006
E.L. Lehmann, J.P. Romano, Testing Statistical Hypotheses, 3rd edn. Springer Texts in Statistics (Springer, New York, 2005)
R. Lempel, Personalization is a two-way street, in Proceedings of the Eleventh ACM Conference on Recommender Systems (2017), pp. 3–3
T. Mahmood, F. Ricci, Learning and adaptivity in interactive recommender systems. in ICEC ’07: Proceedings of the Ninth International Conference on Electronic Commerce, New York, NY (ACM, New York, 2007), pp. 75–84
A. Maksai, F. Garcin, B. Faltings, Predicting online performance of news recommender systems through richer evaluation metrics, in Proceedings of the 9th ACM Conference on Recommender Systems (2015), pp. 179–186
B.M. Marlin, R.S. Zemel, Collaborative prediction and ranking with non-random missing data, in Proceedings of the 2009 ACM Conference on Recommender Systems, RecSys 2009, New York, NY, October 23–25, 2009, pp. 5–12
P. Massa, B. Bhattacharjee, Using trust in recommender systems: An experimental analysis. in Proceedings of iTrust2004 International Conference (2004), pp. 221–235
M.R. McLaughlin, J.L. Herlocker, A collaborative filtering algorithm and evaluation metric that accurately model the user experience, in SIGIR ’04: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, NY (ACM, New York, 2004), pp. 329–336
H.B. McMahan, G. Holt, D. Sculley, M. Young, D. Ebner, J. Grady, L. Nie, T. Phillips, E. Davydov, D. Golovin, et al., Ad click prediction: a view from the trenches, in Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2013), pp. 1222–1230
S.M. McNee, J. Riedl, J.A. Konstan, Making recommendations better: an analytic model for human-recommender interaction. in CHI ’06: CHI ’06 Extended Abstracts on Human Factors in Computing Systems, New York, NY, 2006 (ACM, New York, 2006), pp. 1103–1108
F. McSherry, I. Mironov, Differentially private recommender systems: building privacy into the netflix prize contenders. in KDD ’09: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY (ACM, New york, 2009), pp. 627–636
B. Mobasher, R. Burke, R. Bhaumik, C. Williams, Toward trustworthy recommender systems: an analysis of attack models and algorithm robustness. ACM Trans. Internet Technol. 7(4), 23 (2007)
T. Murakami, K. Mori, R. Orihara, Metrics for evaluating the serendipity of recommendation lists. New Front. Artif. Intell. 4914, 40–46 (2008)
T.T. Nguyen, D. Kluver, T.-Y. Wang, P.-M. Hui, M.D. Ekstrand, M.C. Willemsen, J. Riedl, Rating support interfaces to improve user experience and recommender accuracy, in Proceedings of the 7th ACM Conference on Recommender Systems, RecSys ’13, New York, NY (ACM, New York, 2013), pp. 149–156
M. O’Mahony, N. Hurley, N. Kushmerick, G. Silvestre, Collaborative recommendation: a robustness analysis. ACM Trans. Internet Technol. 4(4), 344–377 (2004)
S.L. Pfleeger, B.A. Kitchenham, Principles of survey research. SIGSOFT Softw. Eng. Notes 26(6), 16–18 (2001)
P. Pu, L. Chen, Trust building with explanation interfaces, in IUI ’06: Proceedings of the 11th International Conference on Intelligent User Interfaces, New York, NY, 2006 (ACM, New York, 2006), pp. 93–100
P. Pu, L. Chen, R. Hu, A user-centric evaluation framework for recommender systems, in Proceedings of the Fifth ACM Conference on Recommender Systems, RecSys ’11, New York, NY (ACM, New York, 2011), pp. 157–164
P. Pu, L. Chen, R. Hu, A user-centric evaluation framework for recommender systems, in Proceedings of the Fifth ACM Conference on Recommender Systems (2011), pp. 157–164
S. Queiroz, Adaptive preference elicitation for top-k recommendation tasks using gai-networks, in AIAP’07: Proceedings of the 25th Conference on Proceedings of the 25th IASTED International Multi-Conference, Anaheim, CA, 2007 (ACTA Press, Calgary, 2007), pp. 579–584
S. Rendle, C. Freudenthaler, Z. Gantner, L. Schmidt-Thieme, BPR: Bayesian personalized ranking from implicit feedback, in UAI ’09: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, 2009
F. Ricci, Recommender systems in tourism, in Handbook of e-Tourism (Springer, Cham, 2020), pp. 1–18
M. Rossetti, F. Stella, M. Zanker, Contrasting offline and online results when evaluating recommendation algorithms, in Proceedings of the 10th ACM Conference on Recommender Systems (2016), pp. 31–34
Margaret L Russell, Donna G Moralejo, and Ellen D Burgess. Paying research subjects: participants’ perspectives. J. Med. Ethics 26(2), 126–130 (2000)
A. Said, A short history of the recsys challenge. AI Mag. 37(4), 102–104 (2017)
A. Said, A. Bellogín, Comparative recommender system evaluation: benchmarking recommendation frameworks, in Proceedings of the 8th ACM Conference on Recommender Systems (2014), pp. 129–136
A. Said, A. Bellogín, Rival: a toolkit to foster reproducibility in recommender system evaluation, in Proceedings of the 8th ACM Conference on Recommender systems (2014), pp. 371–372
S.L. Salzberg, On comparing classifiers: Pitfalls toavoid and a recommended approach. Data Min. Knowl. Discov. 1(3), 317–328 (1997)
M.R. Santana, L.C. Melo, F.H.F. Camargo, B. Brandão, A. Soares, R.M. Oliveira, S. Caetano, Mars-gym: a gym framework to model, train, and evaluate recommender systems for marketplaces (2020). Preprint. arXiv:2010.07035
B. Sarwar, G. Karypis, J. Konstan, J. Riedl, Analysis of recommendation algorithms for e-commerce, in EC ’00: Proceedings of the 2nd ACM Conference on Electronic Commerce, New York, NY (ACM, New York, 2000), pp. 158–167
B. Sarwar, G. Karypis, J. Konstan, J. Reidl, Item-based collaborative filtering recommendation algorithms. in WWW ’01: Proceedings of the 10th International Conference on World Wide Web, New York, NY (ACM, New York, 2001), pp. 285–295
B. Sarwar, G. Karypis, J. Konstan, J. Riedl, Item-based collaborative filtering recommendation algorithms. in Proceedings of the 10th International Conference on World Wide Web (2001), pp. 285–295
A.I. Schein, A. Popescul, L.H. Ungar, D.M. Pennock, Methods and metrics for cold-start recommendations. in SIGIR ’02: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, NY, 2002 (ACM, New York, 2002), pp. 253–260
S. Sedhain, A.K. Menon, S. Sanner, L. Xie, Autorec: autoencoders meet collaborative filtering, in Proceedings of the 24th International Conference on World Wide Web (2015), pp. 111–112
G. Shani, D. Heckerman, R.I. Brafman, An MDP-based recommender system. J. Mach. Learn. Res. 6, 1265–1295 (2005)
G. Shani, D.M. Chickering, C. Meek, Mining recommendations from the web, in RecSys ’08: Proceedings of the 2008 ACM Conference on Recommender Systems (2008), pp. 35–42
G. Shani, L. Rokach, B. Shapira, S. Hadash, M. Tangi, Investigating confidence displays for top-n recommendations. JASIST 64(12), 2548–2563 (2013)
N. Silberstein, O. Somekh, Y. Koren, M. Aharon, D. Porat, A. Shahar, T. Wu, Ad close mitigation for improved user experience in native advertisements, in Proceedings of the 13th International Conference on Web Search and Data Mining (2020), pp. 546–554
B. Smyth, P. McClave, Similarity vs. diversity, in ICCBR (2001), pp. 347–361
W.J. Spillman, E. Lang, The Law of Diminishing Returns (World Book Company, New York, 1924)
H. Steck, Item popularity and recommendation accuracy, in Proceedings of the Fifth ACM Conference on Recommender Systems, RecSys ’11, New York, NY, 2011 (ACM, New york, 2011), pp. 125–132
H. Steck, Evaluation of recommendations: rating-prediction and ranking, in Seventh ACM Conference on Recommender Systems, RecSys ’13, Hong Kong, China, October 12–16, (2013), pp. 213–220
K. Swearingen, R. Sinha, Beyond algorithms: An HCI perspective on recommender systems, in ACM SIGIR 2001 Workshop on Recommender Systems, 2001
C.J. Van Rijsbergen, Information Retrieval (Butterworth-Heinemann, Newton, MA, 1979)
E.M. Voorhees, The philosophy of information retrieval evaluation, in CLEF ’01: Revised Papers from the Second Workshop of the Cross-Language Evaluation Forum on Evaluation of Cross-Language Information Retrieval Systems (Springer, London, 2002), pp. 355–370
E.M. Voorhees, Overview of trec 2002, in Proceedings of the 11th Text Retrieval Conference (TREC 2002), NIST Special Publication 500-251 (2002), pp. 1–15
Y.Y. Yao, Measuring retrieval effectiveness based on user preference of documents. J. Am. Soc. Inf. Syst. 46(2), 133–145 (1995)
E. Yilmaz, J.A. Aslam, S. Robertson, A new rank correlation coefficient for information retrieval. in Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’08, New York, NY, 2008 (ACM, New York, 2008), pp. 587–594
Y. Zeldes, S. Theodorakis, E. Solodnik, A. Rotman, G. Chamiel, D. Friedman, Deep density networks and uncertainty in recommender systems (2017). Preprint. ArXiv:1711.02487
M. Zhang, N. Hurley, Avoiding monotony: improving the diversity of recommendation lists, in RecSys ’08: Proceedings of the 2008 ACM Conference on Recommender Systems (ACM, New York, NY, 2008), pp. 123–130
S. Zhang, L. Yao, A. Sun, Y. Tay, Deep learning based recommender system: a survey and new perspectives. ACM Comput. Surv. (CSUR) 52(1), 1–38 (2019)
Y. Zhang, J. Callan, T. Minka, Novelty and redundancy detection in adaptive filtering, in SIGIR ’02: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (ACM, New York, NY, 2002), pp. 81–88
C.-N. Ziegler, S.M. McNee, J.A. Konstan, G. Lausen, Improving recommendation lists through topic diversification, in WWW 0́5: Proceedings of the 14th International Conference on World Wide Web (ACM, New York, 2005), pp. 22–32
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Science+Business Media, LLC, part of Springer Nature
About this chapter
Cite this chapter
Gunawardana, A., Shani, G., Yogev, S. (2022). Evaluating Recommender Systems. In: Ricci, F., Rokach, L., Shapira, B. (eds) Recommender Systems Handbook. Springer, New York, NY. https://doi.org/10.1007/978-1-0716-2197-4_15
Download citation
DOI: https://doi.org/10.1007/978-1-0716-2197-4_15
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-0716-2196-7
Online ISBN: 978-1-0716-2197-4
eBook Packages: Computer ScienceComputer Science (R0)