Learning to recommend similar items from human judgments

Trattner, Christoph; Jannach, Dietmar

doi:10.1007/s11257-019-09245-4

Learning to recommend similar items from human judgments

Published: 10 September 2019

Volume 30, pages 1–49, (2020)
Cite this article

User Modeling and User-Adapted Interaction Aims and scope Submit manuscript

912 Accesses
19 Citations
4 Altmetric
Explore all metrics

Abstract

Similar item recommendations—a common feature of many Web sites—point users to other interesting objects given a currently inspected item. A common way of computing such recommendations is to use a similarity function, which expresses how much alike two given objects are. Such similarity functions are usually designed based on the specifics of the given application domain. In this work, we explore how such functions can be learned from human judgments of similarities between objects, using two domains of “quality and taste”—cooking recipe and movie recommendation—as guiding scenarios. In our approach, we first collect a few thousand pairwise similarity assessments with the help of crowdworkers. Using these data, we then train different machine learning models that can be used as similarity functions to compare objects. Offline analyses reveal for both application domains that models that combine different types of item characteristics are the best predictors for human-perceived similarity. To further validate the usefulness of the learned models, we conducted additional user studies. In these studies, we exposed participants to similar item recommendations using a set of models that were trained with different feature subsets. The results showed that the combined models that exhibited the best offline prediction performance led to the highest user-perceived similarity, but also to recommendations that were considered useful by the participants, thus confirming the feasibility of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Boosting the Item-Based Collaborative Filtering Model with Novel Similarity Measures

Article Open access 29 July 2023

QuicklyCook: A User-Friendly Recipe Recommender

Similarity modifiers for enhancing the recommender system performance

Article 30 October 2021

Notes

Earlier work discussing the concept of similarities between objects from a psychological perspective can be, for example, found in Tversky and Gati (1978). In their work, the authors argue that human judgment of similarity is not only feature based, as is assumed in our work. We agree with this view and see the exploration of this topic as a promising area for future work.
Stability and reliability aspects of human judgments in the music domain are also discussed in Jones et al. (2007).
Note that on allrecipes.com the provided descriptions, e.g., ingredient lists, are peer-reviewed and standardized by community editors. This is in particular the case for recipes that are published under the main dish category, which we consider in this study. Applying our methods to other recipe datasets would make it necessary to apply a preprocessing step to standardize the ingredients in the corpus (see, for example, Trattner et al. (2019)).
Released August 2018: https://grouplens.org/datasets/movielens/latest/.
https://www.themoviedb.org/.
https://developers.themoviedb.org/3.
Details about the exact computation of the measures are provided in Table 10 in Appendix.
LDA was also successfully used for recipe titles in Kusmierczyk and Nørvåg (2016) and Rokicki et al. (2018).
Perplexity was used as criterion to tune the model parameters. We run experiments from 10 to 1000 topics for all LDA models. At the end, we decided to use the models with 100 topics which gave us close to optimal performance while keeping the number of features and computational costs low.
http://www.openimaj.org/.
The parameter was estimated in a user study by Hasler and Suesstrunk (2003) in 2003 and is considered to be optimal. In their work, Hasler et al. obtained a correlation of more than 95% with human judgment using this formula and parameter.
We plan to explore the use of alternative architectures in the future, such as ResNet (He et al. 2016) and Inception (Szegedy et al. 2016)
https://keras.io/.
This procedure is similar to the one used in Yao and Harper (2018). Alternative approaches for collecting similarity judgments are possible, e.g., by using a third item as a reference for the participants. Such designs might, however, lead to an increased complexity of the judgment task.
We have chosen main dishes as they are one of the most popular categories on the platform and we did not want that our study is confined to a smaller subset of recipe types on the platforms. Second, main dishes can be quite varied, which makes the similar item retrieval task more challenging than, for example, for deserts. Finally, one of our goals was to be consistent with previous works which also used main dishes as a basis for their experiments, e.g., Howard et al. (2012) and Trattner et al. (2018).
The reason for using this procedure is to ensure that we obtain a larger number of judgments for a diverse set of items. This in turn allows to train more reliable models with a constrained budget. Having more judges per pair is possible, but needs significantly more study participants if we want to make sure that many dishes or movies are covered by the judgments.
HIT stands for Human Intelligence Task on Amazon Mechanical Turk.
The homogeneity of variances for all ANOVA tests was checked with Levene’s test.
We have chosen Spearman as a correlation metric as the data (\(=\)user ratings) is (a) not normally distributed and (b) on an ordinal scale.
Image embeddings have been shown to be useful in many different application areas of multimedia. Recently, image embeddings have not only been used to classify images but also in the context of recommender systems to, for example, recommend images, etc., to people (see, e.g., Messina et al. (2018)). Compared to explicit feature-based approaches, as also used in this paper, embeddings can capture several aspects of an image at the same time such as shapes and color.
Similar discrepancies were previously analyzed in the field of psychology, e.g., in Einhorn et al. (1979).
Compared to a standard Ordinary Least Squares models, Lasso and Ridge regressions introduce regularization terms (penalties) in their models (Tibshirani 1996). The aim of Ridge regression is to “minimize the sum of squared residuals but also penalize the size of parameter estimates, in order to shrink them towards zero” (Oleszak 2018). The penalty is also called L2 penalty. Lasso, in contrast, is based on an L1 penalty; for further details, see (Oleszak 2018). An alternative would be to use explicit feature selection such as done in O’Mahony et al. (2009).
We used R’s caret package for that purpose. Further details, on model training and parameter tuning can be found here: https://topepo.github.io/caret/model-training-and-tuning.html#basic-parameter-tuning.
The attention check for the movie domain study was more or less identical as in the recipes study. Instead of displaying the attention check in the “directions” text, we displayed it in the “stars” section.
We chose a list length of 5 items not only to keep the cognitive load for participants low but also because on recipe sites often not more than 5 recommendations are displayed (without scrolling).
The set \(R \backslash r_i\) does not contain recipes or movie pairs already used in Study 1a and Study 1b, respectively.
The attention check was in the “description” section for the recipe recommender study and in the “star(s)” section for the movie study.
Considering recommendations for reference recipes that the user does not like, e.g., because she is a vegetarian but the reference meal is meat, will lead to low response values also for the recommendations, as they are assumed to be similar.
We see this as another indicator of the reliability of the respondents.

References

Adomavicius, G., Kwon, Y.: Improving aggregate recommendation diversity using ranking-based techniques. IEEE Trans. Knowl. Data Eng. 24(5), 896–911 (2012)
Article Google Scholar
Allison, L., Dix, T.I.: A bit-string longest-common-subsequence algorithm. Inf. Process. Lett. 23(5), 305–310 (1986)
Article MathSciNet Google Scholar
Aucouturier, J.J., Pachet, F., et al.: Music similarity measures: what’s the use? In: Proceedings of ISMIR ’02 (2002)
Beel, J., Langer, S.: A comparison of offline evaluations, online evaluations, and user studies in the context of research-paper recommender systems. In: Proceedings of TPDL ’15 (2015)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Brovman, Y.M., Jacob, M., Srinivasan, N., Neola, S., Galron, D., Snyder, R., Wang, P.: Optimizing similar item recommendations in a semi-structured marketplace to maximize conversion. In: Proceedings of RecSys ’16 (2016)
Buhrmester, M., Kwang, T., Gosling, S.D.: Amazon’s mechanical Turk: a new source of inexpensive, yet high-quality, data? Perspect. Psychol. Sci. 6(1), 3–5 (2011)
Article Google Scholar
Colucci, L., Doshi, P., Lee, K.L., Liang, J., Lin, Y., Vashishtha, I., Zhang, J., Jude, A.: Evaluating item–item similarity algorithms for movies. In: Proceedings of CHI EA ’16 (2016)
Cremonesi, P., Garzotto, F., Turrin, R.: Investigating the persuasion potential of recommender systems from a quality perspective: an empirical study. ACM Trans. Intell. Syst. Technol. (2012). https://doi.org/10.1145/2209310.2209314
Article Google Scholar
Deldjoo, Y., Elahi, M., Cremonesi, P., Garzotto, F., Piazzolla, P., Quadrana, M.: Content-based video recommendation system based on stylistic visual features. J. Data Semant. 5(2), 1–15 (2016)
Article Google Scholar
Ebizma: Ebizma Rankings for Recipe Websites (2017). http://www.ebizmba.com/articles/recipe-websites. Accessed 19 April 2017
Eksombatchai, C., Jindal, P., Liu, J.Z., Liu, Y., Sharma, R., Sugnet, C., Ulrich, M., Leskovec, J.: Pixie: a system for recommending 3+ billion items to 200+ million users in real-time. In: Proceedings of the Web Conference ’18 (2018)
Ellis, D.P.W., Whitman, B., Berenzweig, A., Lawrence, S.: The quest for ground truth in musical artist similarity. In: Proceedings of ISMIR ’02 (2002)
Elsweiler, D., Trattner, C., Harvey, M.: Exploiting food choice biases for healthier recipe recommendation. In: Proceedings of SIGIR ’17 (2017)
Freyne, J., Berkovsky, S.: Intelligent food planning: personalized recipe recommendation. In: Proceedings of IUI ’10 (2010)
Garcin, F., Faltings, B., Donatsch, O., Alazzawi, A., Bruttin, C., Huber, A.: Offline and online evaluation of news recommender systems at swissinfo.ch. In: Proceedings of RecSys ’14 (2014)
Gedikli, F., Jannach, D.: Improving recommendation accuracy based on item-specific tag preferences. ACM Trans. Intell. Syst. Technol. 4(1), 43–55 (2013)
Article Google Scholar
Gedikli, F., Jannach, D., Ge, M.: How should I explain? A comparison of different explanation types for recommender systems. Int. J. Hum Comput Stud. 72(4), 367–382 (2014)
Article Google Scholar
Golbeck, J., Hendler, J., et al.: Filmtrust: movie recommendations using trust in web-based social networks. In: Proceedings of CCNC ’06 (2006)
Harvey, M., Ludwig, B., Elsweiler, D.: You are what you eat: learning user tastes for rating prediction. In: Proceedings of SPIRE ’13 (2013)
Hasler, D., Suesstrunk, S.E.: Measuring colorfulness in natural images. In: Human vision and electronic imaging VIII, vol. 5007, pp. 87–96. International Society for Optics and Photonics (2003)
Hauser, D.J., Schwarz, N.: Attentive Turkers: MTurk participants perform better on online attention checks than do subject pool participants. Behav. Res. Methods 48(1), 400–407 (2016)
Article Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR ’16, pp. 770–778 (2016)
Howard, S., Adams, J., White, M., et al.: Nutritional content of supermarket ready meals and recipes by television chefs in the United Kingdom: cross sectional study. BMJ 345, e7607 (2012)
Article Google Scholar
Einhorn, H.J., Kleinmuntz, D.N., Kleinmuntz, B.: Linear regression and process-tracing models of judgment. Psychol. Rev. 86, 465–485 (1979)
Article Google Scholar
Jannach, D., Adomavicius, G.: Recommendations with a purpose. In: Proceedings of RecSys ’16 (2016)
Jaro, M.A.: Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J. Am. Stat. Assoc. 84(406), 414–420 (1989)
Article Google Scholar
Jones, M.C., Downie, J.S., Ehmann, A.F.: Human similarity judgments: implications for the design of formal evaluations. In: Proceedings of ISMIR ’07 (2007)
Kim, S.D., Lee, Y.J., Cho, H.G., Yoon, S.M.: Complexity and similarity of recipes based on entropy measurement. Indian J. Sci. Technol. (2016). https://doi.org/10.17485/ijst/2016/v9i26/97324
Article Google Scholar
Knijnenburg, B.P., Willemsen, M.C., Gantner, Z., Soncu, H., Newell, C.: Explaining the user experience of recommender systems. User Model. User Adapt. Interact. 22(4), 441–504 (2012)
Article Google Scholar
Kondrak, G.: N-gram similarity and distance. In: Proceedings of SPIRE ’05, pp. 115–126. Springer (2005)
Kusmierczyk, T., Nørvåg, K.: Online food recipe title semantics: combining nutrient facts and topics. In: Proceedings of CIKM ’16 (2016)
Lee, J.H.: Crowdsourcing music similarity judgments using mechanical Turk. In: Proceedings of ISMIR ’10 (2010)
Lops, P., De Gemmis, M., Semeraro, G.: Content-based recommender systems: state of the art and trends. In: Ricci, F., Rokach, L., Shapira, B., Kantor, P.B. (eds.) Recommender Systems Handbook. Springer, New York (2011)
Google Scholar
Maksai, A., Garcin, F., Faltings, B.: Predicting online performance of news recommender systems through richer evaluation metrics. In: Proceedings of RecSys ’15 (2015)
Messina, P., Dominguez, V., Parra, D., Trattner, C., Soto, A.: Content-based artwork recommendation: integrating painting metadata with neural and manually-engineered visual features. User Model. User Adapt. Interact. 28, 40 (2018)
Google Scholar
Milosavljevic, M., Navalpakkam, V., Koch, C., Rangel, A.: Relative visual saliency differences induce sizable bias in consumer choice. J. Consum. Psychol. 22(1), 67–74 (2012)
Article Google Scholar
Mirizzi, R., Di Noia, T., Ragone, A., Ostuni, V.C., Di Sciascio, E.: Movie recommendation with DBpedia. In: Proceedings of IIR ’12 (2012)
Oleszak, M.: Regularization: Ridge, lasso and elastic net (2018). https://www.datacamp.com/community/tutorials/tutorial-ridge-lasso-elastic-net. Accessed June 2019
O’Mahony, M.P., Smyth, B.: Learning to recommend helpful hotel reviews. In: Proceedings of the Third ACM Conference on Recommender Systems, RecSys ’09, pp. 305–308 (2009)
Ostuni, V.C., Di Noia, T., Di Sciascio, E., Mirizzi, R.: Top-n recommendations from implicit feedback leveraging linked open data. In: Proceedings of RecSys ’13 (2013)
Peer, E., Vosgerau, J., Acquisti, A.: Reputation as a sufficient condition for data quality on Amazon Mechanical Turk. Behav. Res. Methods 46(4), 1023–1031 (2014)
Article Google Scholar
Pu, P., Chen, L., Hu, R.: A user-centric evaluation framework for recommender systems. In: Proceedings of RecSys ’11 (2011)
Rokicki, M., Trattner, C., Herder, E.: The impact of recipe features, social cues and demographics on estimating the healthiness of online recipes. In: Proceedings of ICWSM ’18 (2018)
Rossetti, M., Stella, F., Zanker, M.: Contrasting offline and online results when evaluating recommendation algorithms. In: Proceedings of RecSys ’16 (2016)
Rublee, E., Rabaud, V., Konolige, K., Bradski, G.R.: ORB: an efficient alternative to SIFT or SURF. In: Proceedings of ICCV ’14, vol. 11, p. 2 (2011)
San Pedro, J., Siersdorfer, S.: Ranking and classifying attractiveness of photos in folksonomies. In: Proceedings of WWW ’09 (2009)
Sen, S., Vig, J., Riedl, J.: Tagommenders: connecting users to items through tags. In: Proceedings of WWW ’09 (2009)
Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27(3), 379–423 (1948)
Article MathSciNet Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2014). arXiv preprint arXiv:1409.1556
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of CVPR ’16, pp. 2818–2826 (2016)
Teng, C.Y., Lin, Y.R., Adamic, L.A.: Recipe recommendation using ingredient networks. In: Proceedings of WebSci ’12 (2012)
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B (Methodol.) 58(1), 267–288 (1996)
MathSciNet MATH Google Scholar
Tran, T.N.T., Atas, M., Felfernig, A., Stettinger, M.: An overview of recommender systems in the healthy food domain. J. Intell. Inf. Syst. 50, 501–526 (2017)
Article Google Scholar
Trattner, C., Elsweiler, D.: Food recommender systems: important contributions, challenges and future research directions (2017a). arXiv preprint arXiv:1711.02760
Trattner, C., Elsweiler, D.: Investigating the healthiness of internet-sourced recipes: implications for meal planning and recommender systems. In: Proceedings of WWW ’17, pp. 489–498 (2017b)
Trattner, C., Moesslang, D., Elsweiler, D.: On the predictability of the popularity of online recipes. EPJ Data Sci. (2018). https://doi.org/10.1140/epjds/s13688-018-0149-5
Article Google Scholar
Trattner, C., Kusmierczyk, T., Nørvåg, K.: Investigating and predicting online food recipe upload behavior. Inf. Process. Manag. 56(3), 654–673 (2019)
Article Google Scholar
Tversky, A., Gati, I.: Studies of similarity. Cognit. Categ. 1(1978), 79–98 (1978)
Google Scholar
van Pinxteren, Y., Geleijnse, G., Kamsteeg, P.: Deriving a recipe similarity measure for recommending healthful meals. In: Proceedings of IUI ’11 (2011)
Vargas, S., Castells, P.: Rank and relevance in novelty and diversity metrics for recommender systems. In: Proceedings of RecSys ’11 (2011)
Vig, J., Sen, S., Riedl, J.: Tagsplanations: explaining recommendations using tags. In: Proceedings of IUI ’09, pp. 47–56 (2009)
Wang, L., Li, Q., Li, N., Dong, G., Yang, Y.: Substructure similarity measurement in Chinese recipes. In: Proceedings of WWW ’08 (2008)
Wang, C., Agrawal, A., Li, X., Makkad, T., Veljee, E., Mengshoel, O., Jude, A.: Content-based top-n recommendations with perceived similarity. In: Proceedings of SMC ’17 (2017)
Yang, L., Hsieh, C.K., Yang, H., Pollak, J.P., Dell, N., Belongie, S., Cole, C., Estrin, D.: Yum-me: a personalized nutrient-based meal recommender system. ACM Trans. Inf. Syst. 36(1), 7 (2017)
Article Google Scholar
Yao, Y., Harper, F.M.: Judging similarity: a user-centric study of related item recommendations. In: Proceedings of RecSys ’18 (2018)
Yujian, L., Bo, L.: A normalized Levenshtein distance metric. IEEE Trans. Pattern Anal. Mach. Intell. 29(6), 1091–1095 (2007)
Article Google Scholar
Zhong, Y., Menezes, T.L.S., Kumar, V., Zhao, Q., Harper, F.M.: A field study of related video recommendations: newest, most similar, or most relevant? In: Proceedings of RecSys ’18 (2018)
Ziegler, C.N., McNee, S.M., Konstan, J.A., Lausen, G.: Improving recommendation lists through topic diversification. In: Proceedings of WWW ’05 (2005)

Download references

Author information

Authors and Affiliations

University of Bergen, Bergen, Norway
Christoph Trattner
University of Klagenfurt, Klagenfurt, Austria
Dietmar Jannach

Authors

Christoph Trattner
View author publications
You can also search for this author in PubMed Google Scholar
Dietmar Jannach
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christoph Trattner.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

See Tables 10, 11, 12, 13, 14, 15, 16 and Figs. 11, 12, 13, 14, 15, and 16.

Table 10 Similarity metrics computed based on movie titles, images, plots, genres, director(s), release dates and stars

Full size table

Table 11 Similarity metric correlation (Spearman) with user similarity estimates per cues when metrics are linearly combined (movie domain) using equal weights in the linear model

Full size table

Table 12 Results when considering additional features (movie domain)

Full size table

Table 13 Results when considering only one information cue at the time (movie domain)

Full size table

Table 14 Survey questions for the recipe domain

Full size table

Table 15 Survey questions for the movie domain

Full size table

Table 16 Recipe and movie dataset content feature statistics

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Trattner, C., Jannach, D. Learning to recommend similar items from human judgments. User Model User-Adap Inter 30, 1–49 (2020). https://doi.org/10.1007/s11257-019-09245-4

Download citation

Received: 04 February 2019
Accepted: 20 August 2019
Published: 10 September 2019
Issue Date: March 2020
DOI: https://doi.org/10.1007/s11257-019-09245-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning to recommend similar items from human judgments

Abstract

Access this article

Similar content being viewed by others

Boosting the Item-Based Collaborative Filtering Model with Novel Similarity Measures

QuicklyCook: A User-Friendly Recipe Recommender

Similarity modifiers for enhancing the recommender system performance

Notes

References