Skip to main content
Log in

Learning to recommend similar items from human judgments

  • Published:
User Modeling and User-Adapted Interaction Aims and scope Submit manuscript

Abstract

Similar item recommendations—a common feature of many Web sites—point users to other interesting objects given a currently inspected item. A common way of computing such recommendations is to use a similarity function, which expresses how much alike two given objects are. Such similarity functions are usually designed based on the specifics of the given application domain. In this work, we explore how such functions can be learned from human judgments of similarities between objects, using two domains of “quality and taste”—cooking recipe and movie recommendation—as guiding scenarios. In our approach, we first collect a few thousand pairwise similarity assessments with the help of crowdworkers. Using these data, we then train different machine learning models that can be used as similarity functions to compare objects. Offline analyses reveal for both application domains that models that combine different types of item characteristics are the best predictors for human-perceived similarity. To further validate the usefulness of the learned models, we conducted additional user studies. In these studies, we exposed participants to similar item recommendations using a set of models that were trained with different feature subsets. The results showed that the combined models that exhibited the best offline prediction performance led to the highest user-perceived similarity, but also to recommendations that were considered useful by the participants, thus confirming the feasibility of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. Earlier work discussing the concept of similarities between objects from a psychological perspective can be, for example, found in Tversky and Gati (1978). In their work, the authors argue that human judgment of similarity is not only feature based, as is assumed in our work. We agree with this view and see the exploration of this topic as a promising area for future work.

  2. Stability and reliability aspects of human judgments in the music domain are also discussed in Jones et al. (2007).

  3. Note that on allrecipes.com the provided descriptions, e.g., ingredient lists, are peer-reviewed and standardized by community editors. This is in particular the case for recipes that are published under the main dish category, which we consider in this study. Applying our methods to other recipe datasets would make it necessary to apply a preprocessing step to standardize the ingredients in the corpus (see, for example, Trattner et al. (2019)).

  4. Released August 2018: https://grouplens.org/datasets/movielens/latest/.

  5. https://www.themoviedb.org/.

  6. https://developers.themoviedb.org/3.

  7. Details about the exact computation of the measures are provided in Table 10 in Appendix.

  8. LDA was also successfully used for recipe titles in Kusmierczyk and Nørvåg (2016) and Rokicki et al. (2018).

  9. Perplexity was used as criterion to tune the model parameters. We run experiments from 10 to 1000 topics for all LDA models. At the end, we decided to use the models with 100 topics which gave us close to optimal performance while keeping the number of features and computational costs low.

  10. http://www.openimaj.org/.

  11. The parameter was estimated in a user study by Hasler and Suesstrunk (2003) in 2003 and is considered to be optimal. In their work, Hasler et al.  obtained a correlation of more than 95% with human judgment using this formula and parameter.

  12. We plan to explore the use of alternative architectures in the future, such as ResNet (He et al. 2016) and Inception (Szegedy et al. 2016)

  13. https://keras.io/.

  14. This procedure is similar to the one used in Yao and Harper (2018). Alternative approaches for collecting similarity judgments are possible, e.g., by using a third item as a reference for the participants. Such designs might, however, lead to an increased complexity of the judgment task.

  15. We have chosen main dishes as they are one of the most popular categories on the platform and we did not want that our study is confined to a smaller subset of recipe types on the platforms. Second, main dishes can be quite varied, which makes the similar item retrieval task more challenging than, for example, for deserts. Finally, one of our goals was to be consistent with previous works which also used main dishes as a basis for their experiments, e.g., Howard et al. (2012) and Trattner et al. (2018).

  16. The reason for using this procedure is to ensure that we obtain a larger number of judgments for a diverse set of items. This in turn allows to train more reliable models with a constrained budget. Having more judges per pair is possible, but needs significantly more study participants if we want to make sure that many dishes or movies are covered by the judgments.

  17. HIT stands for Human Intelligence Task on Amazon Mechanical Turk.

  18. The homogeneity of variances for all ANOVA tests was checked with Levene’s test.

  19. We have chosen Spearman as a correlation metric as the data (\(=\)user ratings) is (a) not normally distributed and (b) on an ordinal scale.

  20. Image embeddings have been shown to be useful in many different application areas of multimedia. Recently, image embeddings have not only been used to classify images but also in the context of recommender systems to, for example, recommend images, etc., to people (see, e.g., Messina et al. (2018)). Compared to explicit feature-based approaches, as also used in this paper, embeddings can capture several aspects of an image at the same time such as shapes and color.

  21. Similar discrepancies were previously analyzed in the field of psychology, e.g., in Einhorn et al. (1979).

  22. Compared to a standard Ordinary Least Squares models, Lasso and Ridge regressions introduce regularization terms (penalties) in their models (Tibshirani 1996). The aim of Ridge regression is to “minimize the sum of squared residuals but also penalize the size of parameter estimates, in order to shrink them towards zero” (Oleszak 2018). The penalty is also called L2 penalty. Lasso, in contrast, is based on an L1 penalty; for further details, see (Oleszak 2018). An alternative would be to use explicit feature selection such as done in O’Mahony et al. (2009).

  23. We used R’s caret package for that purpose. Further details, on model training and parameter tuning can be found here: https://topepo.github.io/caret/model-training-and-tuning.html#basic-parameter-tuning.

  24. The attention check for the movie domain study was more or less identical as in the recipes study. Instead of displaying the attention check in the “directions” text, we displayed it in the “stars” section.

  25. We chose a list length of 5 items not only to keep the cognitive load for participants low but also because on recipe sites often not more than 5 recommendations are displayed (without scrolling).

  26. The set \(R \backslash r_i\) does not contain recipes or movie pairs already used in Study 1a and Study 1b, respectively.

  27. The attention check was in the “description” section for the recipe recommender study and in the “star(s)” section for the movie study.

  28. Considering recommendations for reference recipes that the user does not like, e.g., because she is a vegetarian but the reference meal is meat, will lead to low response values also for the recommendations, as they are assumed to be similar.

  29. We see this as another indicator of the reliability of the respondents.

References

  • Adomavicius, G., Kwon, Y.: Improving aggregate recommendation diversity using ranking-based techniques. IEEE Trans. Knowl. Data Eng. 24(5), 896–911 (2012)

    Article  Google Scholar 

  • Allison, L., Dix, T.I.: A bit-string longest-common-subsequence algorithm. Inf. Process. Lett. 23(5), 305–310 (1986)

    Article  MathSciNet  Google Scholar 

  • Aucouturier, J.J., Pachet, F., et al.: Music similarity measures: what’s the use? In: Proceedings of ISMIR ’02 (2002)

  • Beel, J., Langer, S.: A comparison of offline evaluations, online evaluations, and user studies in the context of research-paper recommender systems. In: Proceedings of TPDL ’15 (2015)

  • Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  • Brovman, Y.M., Jacob, M., Srinivasan, N., Neola, S., Galron, D., Snyder, R., Wang, P.: Optimizing similar item recommendations in a semi-structured marketplace to maximize conversion. In: Proceedings of RecSys ’16 (2016)

  • Buhrmester, M., Kwang, T., Gosling, S.D.: Amazon’s mechanical Turk: a new source of inexpensive, yet high-quality, data? Perspect. Psychol. Sci. 6(1), 3–5 (2011)

    Article  Google Scholar 

  • Colucci, L., Doshi, P., Lee, K.L., Liang, J., Lin, Y., Vashishtha, I., Zhang, J., Jude, A.: Evaluating item–item similarity algorithms for movies. In: Proceedings of CHI EA ’16 (2016)

  • Cremonesi, P., Garzotto, F., Turrin, R.: Investigating the persuasion potential of recommender systems from a quality perspective: an empirical study. ACM Trans. Intell. Syst. Technol. (2012). https://doi.org/10.1145/2209310.2209314

    Article  Google Scholar 

  • Deldjoo, Y., Elahi, M., Cremonesi, P., Garzotto, F., Piazzolla, P., Quadrana, M.: Content-based video recommendation system based on stylistic visual features. J. Data Semant. 5(2), 1–15 (2016)

    Article  Google Scholar 

  • Ebizma: Ebizma Rankings for Recipe Websites (2017). http://www.ebizmba.com/articles/recipe-websites. Accessed 19 April 2017

  • Eksombatchai, C., Jindal, P., Liu, J.Z., Liu, Y., Sharma, R., Sugnet, C., Ulrich, M., Leskovec, J.: Pixie: a system for recommending 3+ billion items to 200+ million users in real-time. In: Proceedings of the Web Conference ’18 (2018)

  • Ellis, D.P.W., Whitman, B., Berenzweig, A., Lawrence, S.: The quest for ground truth in musical artist similarity. In: Proceedings of ISMIR ’02 (2002)

  • Elsweiler, D., Trattner, C., Harvey, M.: Exploiting food choice biases for healthier recipe recommendation. In: Proceedings of SIGIR ’17 (2017)

  • Freyne, J., Berkovsky, S.: Intelligent food planning: personalized recipe recommendation. In: Proceedings of IUI ’10 (2010)

  • Garcin, F., Faltings, B., Donatsch, O., Alazzawi, A., Bruttin, C., Huber, A.: Offline and online evaluation of news recommender systems at swissinfo.ch. In: Proceedings of RecSys ’14 (2014)

  • Gedikli, F., Jannach, D.: Improving recommendation accuracy based on item-specific tag preferences. ACM Trans. Intell. Syst. Technol. 4(1), 43–55 (2013)

    Article  Google Scholar 

  • Gedikli, F., Jannach, D., Ge, M.: How should I explain? A comparison of different explanation types for recommender systems. Int. J. Hum Comput Stud. 72(4), 367–382 (2014)

    Article  Google Scholar 

  • Golbeck, J., Hendler, J., et al.: Filmtrust: movie recommendations using trust in web-based social networks. In: Proceedings of CCNC ’06 (2006)

  • Harvey, M., Ludwig, B., Elsweiler, D.: You are what you eat: learning user tastes for rating prediction. In: Proceedings of SPIRE ’13 (2013)

  • Hasler, D., Suesstrunk, S.E.: Measuring colorfulness in natural images. In: Human vision and electronic imaging VIII, vol. 5007, pp. 87–96. International Society for Optics and Photonics (2003)

  • Hauser, D.J., Schwarz, N.: Attentive Turkers: MTurk participants perform better on online attention checks than do subject pool participants. Behav. Res. Methods 48(1), 400–407 (2016)

    Article  Google Scholar 

  • He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR ’16, pp. 770–778 (2016)

  • Howard, S., Adams, J., White, M., et al.: Nutritional content of supermarket ready meals and recipes by television chefs in the United Kingdom: cross sectional study. BMJ 345, e7607 (2012)

    Article  Google Scholar 

  • Einhorn, H.J., Kleinmuntz, D.N., Kleinmuntz, B.: Linear regression and process-tracing models of judgment. Psychol. Rev. 86, 465–485 (1979)

    Article  Google Scholar 

  • Jannach, D., Adomavicius, G.: Recommendations with a purpose. In: Proceedings of RecSys ’16 (2016)

  • Jaro, M.A.: Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J. Am. Stat. Assoc. 84(406), 414–420 (1989)

    Article  Google Scholar 

  • Jones, M.C., Downie, J.S., Ehmann, A.F.: Human similarity judgments: implications for the design of formal evaluations. In: Proceedings of ISMIR ’07 (2007)

  • Kim, S.D., Lee, Y.J., Cho, H.G., Yoon, S.M.: Complexity and similarity of recipes based on entropy measurement. Indian J. Sci. Technol. (2016). https://doi.org/10.17485/ijst/2016/v9i26/97324

    Article  Google Scholar 

  • Knijnenburg, B.P., Willemsen, M.C., Gantner, Z., Soncu, H., Newell, C.: Explaining the user experience of recommender systems. User Model. User Adapt. Interact. 22(4), 441–504 (2012)

    Article  Google Scholar 

  • Kondrak, G.: N-gram similarity and distance. In: Proceedings of SPIRE ’05, pp. 115–126. Springer (2005)

  • Kusmierczyk, T., Nørvåg, K.: Online food recipe title semantics: combining nutrient facts and topics. In: Proceedings of CIKM ’16 (2016)

  • Lee, J.H.: Crowdsourcing music similarity judgments using mechanical Turk. In: Proceedings of ISMIR ’10 (2010)

  • Lops, P., De Gemmis, M., Semeraro, G.: Content-based recommender systems: state of the art and trends. In: Ricci, F., Rokach, L., Shapira, B., Kantor, P.B. (eds.) Recommender Systems Handbook. Springer, New York (2011)

    Google Scholar 

  • Maksai, A., Garcin, F., Faltings, B.: Predicting online performance of news recommender systems through richer evaluation metrics. In: Proceedings of RecSys ’15 (2015)

  • Messina, P., Dominguez, V., Parra, D., Trattner, C., Soto, A.: Content-based artwork recommendation: integrating painting metadata with neural and manually-engineered visual features. User Model. User Adapt. Interact. 28, 40 (2018)

    Google Scholar 

  • Milosavljevic, M., Navalpakkam, V., Koch, C., Rangel, A.: Relative visual saliency differences induce sizable bias in consumer choice. J. Consum. Psychol. 22(1), 67–74 (2012)

    Article  Google Scholar 

  • Mirizzi, R., Di Noia, T., Ragone, A., Ostuni, V.C., Di Sciascio, E.: Movie recommendation with DBpedia. In: Proceedings of IIR ’12 (2012)

  • Oleszak, M.: Regularization: Ridge, lasso and elastic net (2018). https://www.datacamp.com/community/tutorials/tutorial-ridge-lasso-elastic-net. Accessed June 2019

  • O’Mahony, M.P., Smyth, B.: Learning to recommend helpful hotel reviews. In: Proceedings of the Third ACM Conference on Recommender Systems, RecSys ’09, pp. 305–308 (2009)

  • Ostuni, V.C., Di Noia, T., Di Sciascio, E., Mirizzi, R.: Top-n recommendations from implicit feedback leveraging linked open data. In: Proceedings of RecSys ’13 (2013)

  • Peer, E., Vosgerau, J., Acquisti, A.: Reputation as a sufficient condition for data quality on Amazon Mechanical Turk. Behav. Res. Methods 46(4), 1023–1031 (2014)

    Article  Google Scholar 

  • Pu, P., Chen, L., Hu, R.: A user-centric evaluation framework for recommender systems. In: Proceedings of RecSys ’11 (2011)

  • Rokicki, M., Trattner, C., Herder, E.: The impact of recipe features, social cues and demographics on estimating the healthiness of online recipes. In: Proceedings of ICWSM ’18 (2018)

  • Rossetti, M., Stella, F., Zanker, M.: Contrasting offline and online results when evaluating recommendation algorithms. In: Proceedings of RecSys ’16 (2016)

  • Rublee, E., Rabaud, V., Konolige, K., Bradski, G.R.: ORB: an efficient alternative to SIFT or SURF. In: Proceedings of ICCV ’14, vol. 11, p. 2 (2011)

  • San Pedro, J., Siersdorfer, S.: Ranking and classifying attractiveness of photos in folksonomies. In: Proceedings of WWW ’09 (2009)

  • Sen, S., Vig, J., Riedl, J.: Tagommenders: connecting users to items through tags. In: Proceedings of WWW ’09 (2009)

  • Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27(3), 379–423 (1948)

    Article  MathSciNet  Google Scholar 

  • Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2014). arXiv preprint arXiv:1409.1556

  • Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of CVPR ’16, pp. 2818–2826 (2016)

  • Teng, C.Y., Lin, Y.R., Adamic, L.A.: Recipe recommendation using ingredient networks. In: Proceedings of WebSci ’12 (2012)

  • Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B (Methodol.) 58(1), 267–288 (1996)

    MathSciNet  MATH  Google Scholar 

  • Tran, T.N.T., Atas, M., Felfernig, A., Stettinger, M.: An overview of recommender systems in the healthy food domain. J. Intell. Inf. Syst. 50, 501–526 (2017)

    Article  Google Scholar 

  • Trattner, C., Elsweiler, D.: Food recommender systems: important contributions, challenges and future research directions (2017a). arXiv preprint arXiv:1711.02760

  • Trattner, C., Elsweiler, D.: Investigating the healthiness of internet-sourced recipes: implications for meal planning and recommender systems. In: Proceedings of WWW ’17, pp. 489–498 (2017b)

  • Trattner, C., Moesslang, D., Elsweiler, D.: On the predictability of the popularity of online recipes. EPJ Data Sci. (2018). https://doi.org/10.1140/epjds/s13688-018-0149-5

    Article  Google Scholar 

  • Trattner, C., Kusmierczyk, T., Nørvåg, K.: Investigating and predicting online food recipe upload behavior. Inf. Process. Manag. 56(3), 654–673 (2019)

    Article  Google Scholar 

  • Tversky, A., Gati, I.: Studies of similarity. Cognit. Categ. 1(1978), 79–98 (1978)

    Google Scholar 

  • van Pinxteren, Y., Geleijnse, G., Kamsteeg, P.: Deriving a recipe similarity measure for recommending healthful meals. In: Proceedings of IUI ’11 (2011)

  • Vargas, S., Castells, P.: Rank and relevance in novelty and diversity metrics for recommender systems. In: Proceedings of RecSys ’11 (2011)

  • Vig, J., Sen, S., Riedl, J.: Tagsplanations: explaining recommendations using tags. In: Proceedings of IUI ’09, pp. 47–56 (2009)

  • Wang, L., Li, Q., Li, N., Dong, G., Yang, Y.: Substructure similarity measurement in Chinese recipes. In: Proceedings of WWW ’08 (2008)

  • Wang, C., Agrawal, A., Li, X., Makkad, T., Veljee, E., Mengshoel, O., Jude, A.: Content-based top-n recommendations with perceived similarity. In: Proceedings of SMC ’17 (2017)

  • Yang, L., Hsieh, C.K., Yang, H., Pollak, J.P., Dell, N., Belongie, S., Cole, C., Estrin, D.: Yum-me: a personalized nutrient-based meal recommender system. ACM Trans. Inf. Syst. 36(1), 7 (2017)

    Article  Google Scholar 

  • Yao, Y., Harper, F.M.: Judging similarity: a user-centric study of related item recommendations. In: Proceedings of RecSys ’18 (2018)

  • Yujian, L., Bo, L.: A normalized Levenshtein distance metric. IEEE Trans. Pattern Anal. Mach. Intell. 29(6), 1091–1095 (2007)

    Article  Google Scholar 

  • Zhong, Y., Menezes, T.L.S., Kumar, V., Zhao, Q., Harper, F.M.: A field study of related video recommendations: newest, most similar, or most relevant? In: Proceedings of RecSys ’18 (2018)

  • Ziegler, C.N., McNee, S.M., Konstan, J.A., Lausen, G.: Improving recommendation lists through topic diversification. In: Proceedings of WWW ’05 (2005)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christoph Trattner.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

See Tables 10, 11, 12, 13, 14, 15, 16 and Figs. 11, 12, 13, 14, 15, and 16.

Table 10 Similarity metrics computed based on movie titles, images, plots, genres, director(s), release dates and stars
Table 11 Similarity metric correlation (Spearman) with user similarity estimates per cues when metrics are linearly combined (movie domain) using equal weights in the linear model
Table 12 Results when considering additional features (movie domain)
Table 13 Results when considering only one information cue at the time (movie domain)
Table 14 Survey questions for the recipe domain
Table 15 Survey questions for the movie domain
Table 16 Recipe and movie dataset content feature statistics
Fig. 11
figure 11

Crowdworker characteristics (who passed the attention check) of the similarity assessment study (movie domain)

Fig. 12
figure 12

Feature importance for the best performing Ridge regression model (movie domain)

Fig. 13
figure 13

Study 1b: Web interface to collect similarity judgments for movies. Regarding the choice of the features to be shown, note that it is not uncommon in practice to show more than just the title, image and short descriptions. ITunes, for example, shows the genre; IMDB shows also plot, directors and star ratings

Fig. 14
figure 14

Screen capture of Study 2b (movie domain)

Fig. 15
figure 15

Study 2b: a helpfulness, b diversity, c surprisingness and d excitingness of the recommended lists (means and std. errors). Scale: 1 (not at all)–5 (totally agree)

Fig. 16
figure 16

Study 2b intention to use the recommendation method in the future (means and std. errors). Scale: 1 (not at all)–5 (totally agree)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Trattner, C., Jannach, D. Learning to recommend similar items from human judgments. User Model User-Adap Inter 30, 1–49 (2020). https://doi.org/10.1007/s11257-019-09245-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11257-019-09245-4

Keywords

Navigation