2.1 Explanations in Recommender Systems
In recent years, beyond focusing on algorithmic accuracy [
49], there has been an increased interest in providing explanations in recommender systems to make convincing recommendations, which is related to user acceptance and satisfaction [
21,
24,
58]. Explanations aim to help users understand the recommended items and make decisions. A number of recommendation explanations have been proposed with varying styles (e.g., user-based, content-based) and visual formats (e.g., textual [
16,
54] or visual [
24]).
One of the most common types of explanations is based on the similarity between items or users [
3,
14,
24,
58]. Herlocker et al. [
24] propose to use the collaborative filtering results to give explanations to users, which significantly improve the acceptance of users. A well-known example is Amazon’s “Users who bought ... also bought ...” explanation. Similar kinds of explanations are also applied by Netflix, Spotify, and other online applications. Though the similarity-based explanations are able to increase the transparency of recommendations [
3,
46], the user’s perceived trustworthiness is found to decrease at the same time [
4].
Besides similarity, some recommendation explanations include content information (e.g., attributes and reviews) of items in recommender systems and show the potential to increase user satisfaction [
4,
16,
42,
53,
63,
66]. For attributes, Vig et al. [
60] design a new method that calculates tags for each item and then uses the tags as explanations to users. Xian et al. [
63] use a three-step framework to provide recommendations and corresponding important attributes as explanations. For reviews, Mcauley et al. [
40] propose to use user reviews to conduct explainable recommendations, where the method,
Hidden Factors ad Topics (HFT) is designed to calculate the hidden connections among users, items, and reviews. Zhang et al. [
66] try to use representative phrases in user reviews to give explainable recommendation results. Recently, an attention-based neural recommendation method is designed by Chen et al. [
8], which is able to give the most valuable review to users. Chang et al. [
7] also designed a process, combining crowd-sourcing and computation, that generates personalized natural language explanation. Zhu et al. [
69] design a multi-task learning model to jointly learn the rating prediction task and recommendation explanation generation task. Hada et al. [
23] propose an end-to-end framework to generate explanations in a plug-and-play way, which is efficient in training. All these attempts show that side information in recommender systems is valuable to generate good explanations for users.
Recently, the knowledge graph, such as Freebase [
5], is incorporated to generate knowledge-related content explanations [
9,
22,
37,
38,
47,
55,
61,
62,
68,
70]. Many knowledge graphs enhanced recommendation methods not only to provide better recommendation results to users, but also to generate explicit recommendation explanations based on knowledge graphs. For example, Ma et al. [
39] propose a multi-task learning framework to utilize rules in knowledge graph to recommender systems, and experimental results show that the rules are also explanations for users about the recommendation result. Zhao et al. [
68] take the temporal information into account and design a time-aware path reasoning method for explainable recommendations. Geng et al. [
22] propose a path language modeling recommendation framework to tackle the recall bias in knowledge path reasoning. It is shown that comprehensive information in knowledge graphs is quite useful for explainable recommendations.
Other information and forms are also applied in the explanations. For instance, previous studies find that explanations indicating average rating [
3] and social-based explanations [
29] are able to enhance users’ trust. Friedrich and Zanker [
19] categorize different explainable recommender systems and argue that future explanations can be developed by including new kinds of information.
In summary, recommender systems have used many types of information to generate explanations, including similarities between items or users, content, ratings, social relationships, and so on. However, which information the users actually need still remains under-explored. In this work, we directly collect the user’s self-explanations about his/her watch intention through which we are able to study the actual motivations behind the user’s decision and his/her desired information for explanations. Meanwhile, we also collect the human-generated explanations and compare them with current system-generated ones in terms of both the effects on user intentions and the user’s perceptions. Furthermore, the patterns of human explaining are categorized and used to generate Human-Inspired Explanations (HIE).
2.2 Effects of Recommendation Explanations
Explanations are expected to increase users’ perceptions of transparency, scrutability, effectiveness, efficiency, persuasiveness, and satisfaction [
2,
15,
57]. Previous works have studied and identified factors related to these targets, including explanation types, explanation attributes, user characteristics, and so on.
Milecamp et al. [
41] compare the recommendation with and without explanations, and find explanations could enhance users’ understanding and increase the effectiveness of the recommendations. Herlocker et al. [
24] evaluate the effectiveness of 21 different explanation styles (e.g., rating-based, neighbor-based) through the user’s response. As for the interfaces, a user study [
60] investigates the effects of four classical explanation interfaces on users’ perceived justifications, effectiveness, and mood compatibility. Kouki et al. [
29] evaluate both single-style explanations (including user-based, item-based, content-based, social-based, and popularity-based) and hybrid explanations by users’ perceptions of transparency, persuasiveness, and satisfaction. Balog and Radlinski [
2] develop a survey-based experimental protocol for evaluating seven different goals (e.g., transparency, persuasiveness) of recommendation explanations, and found close correlations among these goals.
Explanations of different types and styles are found to have different effects on user’s perceptions, but why such differences exist is less studied. Besides, most of the previous studies focus on the impacts on user perceptions (e.g., satisfaction), which can only be collected directly from the users, limiting the evaluation and optimization of explanation [
1,
65]. To approach these problems, the objective factors and measurements related to explanation effectiveness could be the key.
There are also some previous studies that try to analyze the impacts of recommendation results and explanations, especially trust-related effects. Berkovsky et al. [
3] conduct a crowd-sourced study to exam the impact of various recommendation interfaces and content selection strategies on user trust. Besides, Kunkel et al. [
31] describe an empirical study that investigates the trust-related influence of two scenarios: human-generated recommendations and automated recommending.
In this work, we identify major factors related to user satisfaction on explanations, including the information accuracy measured by the consistency of information points between explanations and user self-explanations. It reveals one of the reasons behind the differences in explanation impacts and provides insights into further optimization directions of explanation generations, as well as the offline evaluation protocols.
2.3 Evaluation of Recommendation Explanations
How to evaluate the explanations of recommendations is a challenging problem. Existing work has proposed several methods to solve this problem, which can be coarsely divided into four types, i.e., case studies, quantitative metrics, crowdsourcing, and online experiments [
11].
A simple method is to check the rationality of explanations on a few cases based on human intuitiveness [
35,
36,
62]. It is popular to visualize the weights for different explanation information, such as attributes in attribute-based recommendation [
35], neighbors in graph-based recommendation [
36], and reasoning paths in knowledge-based recommendation [
62]. Although case studies are intuitive, they are biased and cannot be used to compare different models precisely.
Quantitative metrics can provide a more convincing evaluation. A common method is to regard the explanation task as a natural language generation task [
23,
33,
34]. The generated explanations should be consistent with user reviews, which can be measured by the corresponding metrics such as BLEU and ROUGE. Besides, some work [
56] propose metrics from a counterfactual perspective. The main idea is that if the explanations are correct, then the recommendations should change if the features/items used in explanations change. Quantitative metrics are more objective and efficient than case studies, but the designed metrics may not be consistent with the goal of explanations, which we discuss in the following.
Apart from quantitative metrics, some methods involve human feelings in the evaluation of explanations, which are called crowdsourcing [
12,
20,
25,
59,
63]. There are mainly three types according to the dataset construction. First is crowdsourcing with public datasets [
12,
63]. The model provides explanations based on public data and then the recruited annotators evaluate these explanations. The drawback is that the preferences of these annotators may be different from the users in the datasets. The second is crowdsourcing with annotator data and public datasets [
20]. The annotators generate some extra data (e.g., writing reviews) based on public data. Then the model is trained on the combination of the extra data and public data. The annotators only evaluate the explanations for their data as they know the real user preference. The third is crowdsourcing with totally constructed datasets [
25,
59]. The model is trained only on the data of annotators. Crowdsourcing is more accurate but also more expensive than the above two methods.
Online experiments are always a gold criterion for recommender systems. The online users are often randomly divided into two groups, i.e., the model group and the baseline group. Then some utility metrics (e.g., click through rate) can be computed during a certain time. It is believed that better explanations should lead to better results [
63,
66]. Although online experiments are more reliable, they are expensive and may cause negative user experiences.
This work is more related to quantitative metrics and crowdsourcing. Previous work usually leverages user reviews as the ground truth for explanations. However, the reviews reflect the feelings after consuming, which may deviate from the feelings when receiving recommendation explanations. Thus, we collect the real feelings of users when they receive recommendations and make decisions, which are called self-explanation, to observe and analyze the true requirements of users for explanations. Corresponding results can help design better metrics.