research-article

Open access

User Perception of Recommendation Explanation: Are Your Explanations What Users Need?

Authors:

Shaoping MaAuthors Info & Claims

ACM Transactions on Information Systems, Volume 41, Issue 2

Article No.: 48, Pages 1 - 31

https://doi.org/10.1145/3565480

Published: 25 January 2023 Publication History

All formats PDF

Abstract

As recommender systems become increasingly important in daily human decision-making, users are demanding convincing explanations to understand why they get the specific recommendation results. Although a number of explainable recommender systems have recently been proposed, there still lacks an understanding of what users really need in a recommendation explanation. The actual reason behind users’ intention to examine and consume (e.g., click and watch a movie) can be the window to answer this question and is named as self-explanation in this work. In addition, humans usually make recommendations accompanied by explanations, but there remain fewer studies on how humans explain and what we can learn from human-generated explanations.

To investigate these questions, we conduct a novel multi-role, multi-session user study in which users interact with multiple types of system-generated explanations as well as human-generated explanations, namely peer-explanation. During the study, users’ intentions, expectations, and experiences are tracked in several phases, including before and after the users are presented with an explanation and after the content is examined. Through comprehensive investigations, three main findings have been made: First, we observe not only the positive but also the negative effects of explanations, and the impact varies across different types of explanations. Moreover, human-generated explanation, peer-explanation, performs better in increasing user intentions and helping users to better construct preferences, which results in better user satisfaction. Second, based on users’ self-explanation, the information accuracy is measured and found to be a major factor associated with user satisfaction. Some other factors, such as unfamiliarity and similarity, are also discovered and summarized. Third, through annotations of the information aspects used in the human-generated self-explanation and peer-explanation, patterns of how humans explain are investigated, including what information and how much information is utilized. In addition, based on the findings, a human-inspired explanation approach is proposed and found to increase user satisfaction, revealing the potential improvement of further incorporating more human patterns in recommendation explanations.

These findings have shed light on the deeper understanding of the recommendation explanation and further research on its evaluation and generation. Furthermore, the collected data, including human-generated explanations by both the external peers and the users’ selves, will be released to support future research works on explanation evaluation.

1 Introduction

Recommender systems have been increasingly used in various scenarios and play more and more essential roles in people’s daily decision-making. Users increasingly demand convincing and useful explanations to help them understand the recommendations and make decisions [6, 21, 24, 43]. To improve the explainability of the recommendation results, many explainable recommendation methods are recently proposed to present various types of explanations which show different information about the item to the user. Examples of different types of explanation include item-based (such as “Users who bought ... also bought” in Amazon, etc.), user-based, popularity-based, content-based, and social-based, which are commonly used in online applications [3, 4, 42, 44, 53, 66].

However, what explanation is truly helpful to users remains under-explored. The lack of ground truth significantly hinders the evaluation for explanations and limits the development of the explainable recommender system. To this end, we define the user’s actual reasons for examining and consuming the item as a user’s self-explanation, which can be the window to investigate a user’s information needed for the explanations and be used to measure a user’s experiences on other explanations. Taking Figure 1 as an example, various explanations for the recommended movie “North by Northwest” are shown to the user, and result in different user experiences. If knowing the user’s actual motivation for watching, i.e., self-explanation, is about genres and directors, we can measure the information accuracy of these explanations. Since the explanation on the top-left corner matches well with the user’s self-explanation, it is more likely to achieve higher user satisfaction. In addition, as giving explanations when recommending items is the common social activity of humans, one research question to explore is whether humans perform better in explaining, and what can we learn from them? Therefore, we also include some explanations written by the human, namely peer-explanation, and analyze them in terms of information types and amount.

Fig. 1.

In this work, we propose to study the user’s perceptions of different types of explanations and explore the causes behind user satisfaction through self-explanations. Moreover, by comparing the human-generated peer-explanations and the system-generated ones, we are able to analyze the patterns in human explanations. Specifically, we aim to answer the following three research questions:

•

RQ1: How does the recommendation explanation affect users’ experiences in terms of intentions for further interactions and expectations of content?

•

RQ2: What are the factors related to user satisfaction on explanations?

•

RQ3: What can we learn from human-generated explanations, both peer-explanation and self-explanation?

In order to answer these research questions, we conduct a novel multi-role and multi-session user study in the movie domain, in which the users are presented with recommendations and explanations written by other people (i.e., peer-explanations) and various systems. In the study, we track users’ intentions to further interact and expectations for the content and their preference via questionnaires, in pre-explanation, post-explanation and post-examine phases. Furthermore, the user’s real motivations for consuming the item are collected as self-explanation.

With RQ1, we observe a significant positive effect of recommendation explanations on increasing user intention and aligning user expectations. Meanwhile, negative effects also exist. Among different types of explanation, human-generated peer-explanation outperforms system-generated explanations. Since different impacts and user satisfaction are observed between and within the explanation types, we further explore the contributing factors (RQ2). Through users’ self-explanation, the information accuracy and other factors, such as unfamiliarity and personality, are found to be highly correlated with user satisfaction. As humans are better at explaining, we further characterize the human-generated explanation and investigate what and how much information should be used to explain. Learning from these findings, we propose and examine a preliminary Human-Inspired Explanation (HIE) approach to verify the effectiveness of incorporating human patterns in generating explanations.

To sum up, the following contributions are made:

•

Through a multi-role and multi-session user study, we comprehensively measure and investigate the effects of widely used types of recommendation explanations, including machine-generated explanations and human-generated peer-explanation.

•

We collect and define self-explanation as users’ actual motivations behind their intention. To the best of our knowledge, this is the first work that explicitly inspects the user’s real information need for the recommendation explanations. The matching degree in information aspects, namely information accuracy, is found to be highly correlated with user satisfaction with an explanation, indicating the possibility of utilizing self-explanation as reusable ground-truths to evaluate recommendation explanations offline.

•

We characterize the patterns of human-generated explanations, both peer-explanations and user self-explanations, in terms of involved information. A human-inspired explanation method is further proposed and found to achieve encouraging performance, revealing the potential improvement of learning from humans.

2 Related Work

2.1 Explanations in Recommender Systems

In recent years, beyond focusing on algorithmic accuracy [49], there has been an increased interest in providing explanations in recommender systems to make convincing recommendations, which is related to user acceptance and satisfaction [21, 24, 58]. Explanations aim to help users understand the recommended items and make decisions. A number of recommendation explanations have been proposed with varying styles (e.g., user-based, content-based) and visual formats (e.g., textual [16, 54] or visual [24]).

One of the most common types of explanations is based on the similarity between items or users [3, 14, 24, 58]. Herlocker et al. [24] propose to use the collaborative filtering results to give explanations to users, which significantly improve the acceptance of users. A well-known example is Amazon’s “Users who bought ... also bought ...” explanation. Similar kinds of explanations are also applied by Netflix, Spotify, and other online applications. Though the similarity-based explanations are able to increase the transparency of recommendations [3, 46], the user’s perceived trustworthiness is found to decrease at the same time [4].

Besides similarity, some recommendation explanations include content information (e.g., attributes and reviews) of items in recommender systems and show the potential to increase user satisfaction [4, 16, 42, 53, 63, 66]. For attributes, Vig et al. [60] design a new method that calculates tags for each item and then uses the tags as explanations to users. Xian et al. [63] use a three-step framework to provide recommendations and corresponding important attributes as explanations. For reviews, Mcauley et al. [40] propose to use user reviews to conduct explainable recommendations, where the method, Hidden Factors ad Topics (HFT) is designed to calculate the hidden connections among users, items, and reviews. Zhang et al. [66] try to use representative phrases in user reviews to give explainable recommendation results. Recently, an attention-based neural recommendation method is designed by Chen et al. [8], which is able to give the most valuable review to users. Chang et al. [7] also designed a process, combining crowd-sourcing and computation, that generates personalized natural language explanation. Zhu et al. [69] design a multi-task learning model to jointly learn the rating prediction task and recommendation explanation generation task. Hada et al. [23] propose an end-to-end framework to generate explanations in a plug-and-play way, which is efficient in training. All these attempts show that side information in recommender systems is valuable to generate good explanations for users.

Recently, the knowledge graph, such as Freebase [5], is incorporated to generate knowledge-related content explanations [9, 22, 37, 38, 47, 55, 61, 62, 68, 70]. Many knowledge graphs enhanced recommendation methods not only to provide better recommendation results to users, but also to generate explicit recommendation explanations based on knowledge graphs. For example, Ma et al. [39] propose a multi-task learning framework to utilize rules in knowledge graph to recommender systems, and experimental results show that the rules are also explanations for users about the recommendation result. Zhao et al. [68] take the temporal information into account and design a time-aware path reasoning method for explainable recommendations. Geng et al. [22] propose a path language modeling recommendation framework to tackle the recall bias in knowledge path reasoning. It is shown that comprehensive information in knowledge graphs is quite useful for explainable recommendations.

Other information and forms are also applied in the explanations. For instance, previous studies find that explanations indicating average rating [3] and social-based explanations [29] are able to enhance users’ trust. Friedrich and Zanker [19] categorize different explainable recommender systems and argue that future explanations can be developed by including new kinds of information.

In summary, recommender systems have used many types of information to generate explanations, including similarities between items or users, content, ratings, social relationships, and so on. However, which information the users actually need still remains under-explored. In this work, we directly collect the user’s self-explanations about his/her watch intention through which we are able to study the actual motivations behind the user’s decision and his/her desired information for explanations. Meanwhile, we also collect the human-generated explanations and compare them with current system-generated ones in terms of both the effects on user intentions and the user’s perceptions. Furthermore, the patterns of human explaining are categorized and used to generate Human-Inspired Explanations (HIE).

2.2 Effects of Recommendation Explanations

Explanations are expected to increase users’ perceptions of transparency, scrutability, effectiveness, efficiency, persuasiveness, and satisfaction [2, 15, 57]. Previous works have studied and identified factors related to these targets, including explanation types, explanation attributes, user characteristics, and so on.

Milecamp et al. [41] compare the recommendation with and without explanations, and find explanations could enhance users’ understanding and increase the effectiveness of the recommendations. Herlocker et al. [24] evaluate the effectiveness of 21 different explanation styles (e.g., rating-based, neighbor-based) through the user’s response. As for the interfaces, a user study [60] investigates the effects of four classical explanation interfaces on users’ perceived justifications, effectiveness, and mood compatibility. Kouki et al. [29] evaluate both single-style explanations (including user-based, item-based, content-based, social-based, and popularity-based) and hybrid explanations by users’ perceptions of transparency, persuasiveness, and satisfaction. Balog and Radlinski [2] develop a survey-based experimental protocol for evaluating seven different goals (e.g., transparency, persuasiveness) of recommendation explanations, and found close correlations among these goals.

Explanations of different types and styles are found to have different effects on user’s perceptions, but why such differences exist is less studied. Besides, most of the previous studies focus on the impacts on user perceptions (e.g., satisfaction), which can only be collected directly from the users, limiting the evaluation and optimization of explanation [1, 65]. To approach these problems, the objective factors and measurements related to explanation effectiveness could be the key.

There are also some previous studies that try to analyze the impacts of recommendation results and explanations, especially trust-related effects. Berkovsky et al. [3] conduct a crowd-sourced study to exam the impact of various recommendation interfaces and content selection strategies on user trust. Besides, Kunkel et al. [31] describe an empirical study that investigates the trust-related influence of two scenarios: human-generated recommendations and automated recommending.

In this work, we identify major factors related to user satisfaction on explanations, including the information accuracy measured by the consistency of information points between explanations and user self-explanations. It reveals one of the reasons behind the differences in explanation impacts and provides insights into further optimization directions of explanation generations, as well as the offline evaluation protocols.

2.3 Evaluation of Recommendation Explanations

How to evaluate the explanations of recommendations is a challenging problem. Existing work has proposed several methods to solve this problem, which can be coarsely divided into four types, i.e., case studies, quantitative metrics, crowdsourcing, and online experiments [11].

A simple method is to check the rationality of explanations on a few cases based on human intuitiveness [35, 36, 62]. It is popular to visualize the weights for different explanation information, such as attributes in attribute-based recommendation [35], neighbors in graph-based recommendation [36], and reasoning paths in knowledge-based recommendation [62]. Although case studies are intuitive, they are biased and cannot be used to compare different models precisely.

Quantitative metrics can provide a more convincing evaluation. A common method is to regard the explanation task as a natural language generation task [23, 33, 34]. The generated explanations should be consistent with user reviews, which can be measured by the corresponding metrics such as BLEU and ROUGE. Besides, some work [56] propose metrics from a counterfactual perspective. The main idea is that if the explanations are correct, then the recommendations should change if the features/items used in explanations change. Quantitative metrics are more objective and efficient than case studies, but the designed metrics may not be consistent with the goal of explanations, which we discuss in the following.

Apart from quantitative metrics, some methods involve human feelings in the evaluation of explanations, which are called crowdsourcing [12, 20, 25, 59, 63]. There are mainly three types according to the dataset construction. First is crowdsourcing with public datasets [12, 63]. The model provides explanations based on public data and then the recruited annotators evaluate these explanations. The drawback is that the preferences of these annotators may be different from the users in the datasets. The second is crowdsourcing with annotator data and public datasets [20]. The annotators generate some extra data (e.g., writing reviews) based on public data. Then the model is trained on the combination of the extra data and public data. The annotators only evaluate the explanations for their data as they know the real user preference. The third is crowdsourcing with totally constructed datasets [25, 59]. The model is trained only on the data of annotators. Crowdsourcing is more accurate but also more expensive than the above two methods.

Online experiments are always a gold criterion for recommender systems. The online users are often randomly divided into two groups, i.e., the model group and the baseline group. Then some utility metrics (e.g., click through rate) can be computed during a certain time. It is believed that better explanations should lead to better results [63, 66]. Although online experiments are more reliable, they are expensive and may cause negative user experiences.

This work is more related to quantitative metrics and crowdsourcing. Previous work usually leverages user reviews as the ground truth for explanations. However, the reviews reflect the feelings after consuming, which may deviate from the feelings when receiving recommendation explanations. Thus, we collect the real feelings of users when they receive recommendations and make decisions, which are called self-explanation, to observe and analyze the true requirements of users for explanations. Corresponding results can help design better metrics.

3 User Study Procedure

In order to study the research questions, an in-depth user study is conducted. In this section, we describe the overall design of the study procedure, and leave the detailed setups (data, explanation generation, etc.) for the next section. As there are two roles for the participants, namely user and peer, we divide the study into three sessions. In the first session, we collect the user’s historical interests, which are used to generate experimental recommendations and explanations. In the second session, peers are asked to browse the users’ interest records, estimate their preferences on the recommendation items, and write the explanations. In the third session, the same users are called back to examine the recommendations and different explanations, and give feedback on their experience. The pipeline of the study and timeline is shown in Figure 2.

Fig. 2.

3.1 Session 1: User Preference Elicitation

The participants in this session are called user. At the beginning of the study, we ask users to answer a pre-study questionnaire, in which the user’s demographics (including gender and age) and expertise in movie scenarios (including frequency of watching movies) are collected.

To collect users’ interests, we follow the preference elicitation approach proposed in previous work [30, 45]. In particular, users are presented with movies ordered through the method proposed by the GroupLens group [50]: $\log (popularity_i)\times entropy_i.$, with the poster, title, attributes (e.g., directors, writers, actors, region) and plot synopsis shown. To avoid biases, some information, such as average ratings and reviews is not displayed. The user is asked to browse the movies and selects the ones they have watched, until 15 movies have been collected. Then, for each movie watched, the user is asked to rate his/her preference and to write a sentence explaining the rating (i.e., review). These collected historical preference records ([item, rating, review] $\times$ 15) are of the same format as the Movielens dataset and can be directly used for training and recommendations.

After this session, for each user $u$, based on his/her historical ratings, we generate a set of recommendation movies $\mathcal {R}_u$ by the well-established Matrix Factorization method, BiasedMF. These recommended movies will be used in Sessions 2 and 3 of the user study. The details of recommendation generation are described in Section 4.2.

3.2 Session 2: Peer Explanation

The second session starts after the preference elicitation session, where the participants are called peers, who are not duplicated with users. After the same pre-study questionnaire including demographics and expertise, peers are randomly assigned with three users as three tasks. The steps within each task are the same: overall interest perception, item preference estimation, and explanation writing. The interfaces are shown in Figure 3.

Fig. 3.

Overall interest perception: At first, the peer is shown with the user’s historical rating records collected in the first session, including the detailed information, user’s rating and review for each movie. After browsing the records, peers are asked to write their perception of the user’s overall interest. This phase aims to guide the peer to understand the user’s preference.

Item preference estimation and explanation writing: This step aims to collect human-generated explanations. The peer is shown each recommended movie in random order (with full details, same as the user), and is asked to answer several questions, including estimation of the user’s watch intention (“Do you think the user will want to watch this movie?”), estimation of the user’s preference for the movie (“Do you think the user will like this movie?”), explanation for the ratings, and estimation of user preference on all the attributes as peer-explanation (points), the most persuasive attribute and the most similar watched movie. After these questions, the peer is asked to write a textual explanation for the user, which is called peer-explanation (as shown in Figure 3).

3.3 Session 3: User Examination for Explanations

After the previous sessions, for each user, we have a set of recommendation movies, each has system-generated explanations and peer-generated explanations. In the third session, we call back the users (from the first session) to examine the recommendations and explanations. A maximum of eight recommended movies (after excluding the watched movies) are shown to the user as tasks. Within each task, the user is guided to complete a four-phase study. The interfaces of these phases are shown in Figure 4. The interfaces of these phases are shown in Figure 4, and the questionnaires are summarized in Table 1.

Fig. 4.

Table 1.

First Step: User Preference Elicitation (to be answered by Users)
For each movie $i$ user watched (Continuous selecting until 15 movies)
Rating, Review	Please rate this movie with your preference, why? (5-scale Likert)
Second Step: Peer Explanation (to be answered by Peers)
Based on the movies user $u$’s rated, for each recommended movie:
Peer-explanation	What would you say if you recommend this movie to him/her?
Peer-explanation (points)	Please label the following attributes by their effects on the user’s decision. (3-scale, negative, moderate, positive)
Peer-explanation (content)	Which attribute of the movie do you think convinces the user the most? (choice)
Peer-explanation (item)	Among the movies the user watched, which one do you think is most similar? (choice)
Third Step: Explanation Evaluation (to be answered by Users)
Phase I: Pre-Explanation
Each recommendation is shown with title and its poster only
Examine intention	Would you like to know more about this movie? why? (5-scale Likert)
Consume intention	Would you like to watch this movie? why? (5-scale Likert)
Expected preference	What are your expectations of your preference for this movie? why? (5-scale Likert)
Phase II: Post-Explanation
\| For each explanation [shuffled]
\| Examine intention	After reading this explanation, would you like to know about this movie now? (5-scale Likert)
\| Consume intention	Would you like to watch this movie now? (5-scale Likert)
\| Expected preference	What are your expectation of your preference on this movie now? (5-scale Likert)
\| Perceived persuasiveness	This explanation is convincing to me. (5-scale Likert)
\| Perceived transparency	Based on this explanation, I understand why this movie is recommended. (5-scale Likert)
\| Perceived novelty	Did you know this information in advance? (binary)
Phase III: Examining
The detail information of this movie is shown
Consume intention	After reading the details of the movie, would you like to watch it? (5-scale Likert)
Self-explanation	Why do/don’t you want to watch this movie?
Self-explanation (points)	Please label the following attributes of this movie by their effects on your decision. (3-scale negative, moderate, positive)
Phase IV: Post-Examining
\| For each explanation [re-shuffled]
\| Perceived accuracy	This explanation is consistent with my interest. (5-scale Likert)
\| Satisfaction	Are you satisfied with this explanation? why? (5-scale Likert)

Table 1. The Collected Feedback and Corresponding Questions in the Three-session User Study

Phase I: Pre-Explanation: Each recommended movie is shown to the user with only one official picture and its title. The user’s initial intention is to examine (acquire more information, “Would you like to know more about this movie?”), consume (watch the movie, “Would you like to watch this movie?”), and the expected preference (“What is your expectation of preference on this movie?”) are collected.

Phase II: Post-Explanation: After that, users are presented with an explanation of the movie, and asked to re-rate their intentions and expectations after reading the explanation. Besides, the perceived Persuasiveness (“This explanation is persuasive to me”) and Transparency (“Based on this explanation, I understand why this movie is recommended”) of this explanation are also collected via a 5-point Likert scale as referenced in [7, 13, 28, 48, 51, 52]. Besides, we also collect the Novelty value by a binary-answer question (“Do you know this information in advance?”). Then, we show another explanation for this movie to the user, and ask the same questions. To avoid order biases, the presentation order of different explanation styles is randomized via Latin Square among the users (i.e., maintain the uniform order distribution of different explanation types). We emphasize to the users that when answering a question, they should only refer to the currently presented explanation (through the post-study interview, we confirm that participants are able to answer questions independently).

Phase III: Examination: After viewing all the explanations, the users are shown the detail page of the recommended movie (including the most structured information and the synopsis). After reading the page, users are questioned about their intention and expectation at that time, and are asked to explain why they want or don’t want to watch this movie in the detailed text (i.e., Self-Explanation). Besides, we also collect the Point-level self-explanation (“Please label the following attributes of this movie by their effects on your decision: negative, neutral, positive”).

Phase IV: Post-Examination: Finally, we re-show each explanation and ask the users to label the perceived Accuracy (“This movie is consistent with my interest”, 5-point Likert scale) and overall Satisfaction (“Are you satisfied with the explanation?”, 5-scale rating).

4 User Study Setups

In the previous section, we introduce the procedure of the user study, while in this section, we describe the details of the study setups, including the dataset, recommendation generation, experimental explanations generation, and participant recruitment.

4.1 Movie Dataset

As movie scenario is widely used in recommendation research, a large number of recommender algorithms are designed and evaluated in this scenario, and we choose this scenario to conduct our study. Movielens, being the most widely used dataset for recommendation research, is selected to conduct the study for better generalization capability. Specifically, to acquire precise recommendations and include recent movies, we use Movielens Latest (9/2018) dataset, which has 27 million user-item interactions and the newest movies (until 9/2018).

As most of the participants are Chinese, to avoid biases caused by English reading, we use another movie dataset, Douban (collected from Douban API). We match the movies in both datasets through imdbID and retain the matched ones. Finally, the movie dataset we used includes all ratings in Movielens for training, generation of recommendations and explanations, while the Chinese information in Douban (9,510 items) is used for display during the user study. All these datasets and the scripts will be publicly released for future research.²

4.2 Candidate Items Generation

An important choice for the user study setup is the generation of candidate items that are used in item preference collection and perception phases for each user participant (see Section 3.1 and 3.2). Using a random method can eliminate the bias of the algorithm, but in practice (only eight movies are used for each participant), this approach leads to a very narrow distribution of user preference. Using some latest algorithms is another choice, but it will lead the findings highly related to the specific algorithm and lose generality. Therefore, we follow previous work [32] to use a well-established and classic Matrix Factorization model [27]. Specifically, we used the Python implementation of BiasedMF developed by Lenskit [17], with parameter k = 64 and iterations = 150 (well-tuned). Before user study, we evaluate the recommendation method based on the Movielens dataset (with train and valid dataset randomly splitted), and achieve nDCG@100 = 0.1867 (better than other classic methods, UserKNN = 0.1593 and ItemKNN = 0.1752).

In the user study, for each user participant, we use the whole Movielens dataset and the single user’s preference data (collected in the User Preference Elicitation phase) to train the BiasedMF. Finally, the top-8 movies calculated by BiasedMF will be used as candidate recommended items for user $u$.

To eliminate the influence of users’ prior knowledge, we decide to only recommend the movies they have not watched yet through two-step filtering. After the First Step: User Preference Elicitation, we generated the top-100 recommendation by the algorithm and presented them (shuffled order, title only). Users were asked to select movies they had already watched, and these movies were filtered in subsequent steps of the experiment. In addition, when the user received a recommendation, we added a question for a second check (have you watched this movie?), the results of which will be used to filter the data.

4.3 Explanations Generation

In this work, we choose four textual-style [16, 54] and widely-used types of explanations [29]: User-based, Item-based, Popularity-based, and Content-based. The examples of the explanations are shown in Table 2. In this section, we describe the generation of these experimental explanations.

Table 2.

Types	Example
Systems
User-based	[85%] of users who share similar watching tastes with you like North by Northwest after watching it.
Item-based	People who like [Rear Window] also like North by Northwest.
Pop-based	[20,431] users have rated North by Northwest (Top [9%] in popularity).
Content-based (random)	This movie is a [adventure] movies.
Content-based (nonper)	This movie is [acted by Jessie Royee Landis].
Content-based (personal)	This movie is a movie from [USA].
Peers
Peer-explanation	It is an action and suspense movie with a good story.
Item-based (peer)	This movie is very similar to [Rear Window].
Content-based (peer)	This movie is [directed by Alfred Hitchcock].
Users
Self-explanation	Hitchcock’s films make people want to watch. I like this type of movie (action, suspense and crime) and it is moderate in length. Besides, It seems that the plot is very interesting and suspenseful.

Table 2. Examples of the System- and Peer-generated Explanations in our Study

Words in [] are generated by various methods (Section 4.2). “North by Northest” is the recommended movie. (peer and self- explanations are the translated versions).

4.3.1 User-based Explanation (User-based).

The explanation is based on the similarities among users, usually with the statement “user with whom you share similar tastes, like this movie”. Here we build a template as “[$p\%$] users who share tastes with you like this movie”.

Specifically, for a user $u$ and the recommended item $i$, we first find user $u$’s $K$ most similar neighbors $\mathcal {N}_{u}^K$ based on the cosine similarity of their preferences measured by the representations learned by BiasedMF. Then, the proportion $p\%$ is further calculated by the conditional probability:

\begin{equation*} \mathcal {N}_u^K = {\mathop {\arg \max }_{u^{\prime } \in \mathcal {U}}}^{K} \cos (u, u^{\prime }) \end{equation*}

\begin{equation*} p = P(like|watched, i, \mathcal {N}_{u}^K) = \frac{\sum _{u \in \mathcal {N}_u^K}\rm {I}(\rm {like}_{u,i})}{\sum _{u \in \mathcal {N}_u^K}\rm {I}(\rm {watched}_{u,i})} \end{equation*}

where $\rm {watched}_{u,i}=true$ when existing rating $R^{ML}_{u,i}$, $\rm {like}_{u,i}=true$ when the rating higher than 3 ($R^{ML}_{u,i}\gt 3$), $K$ is set as 100 in this work.

4.3.2 Item-based Explanation (Item-based).

The explanation is based on the similarities between the recommended item and the user’s historical preferred items, widely used in online platforms, e.g., Amazon, and so on. Here we build a textual template for recommending movie $i$ to user $u$ as “[$i$] is very similar to [$i^{\prime }$] which you have watched”. The main part is to choose a similar item, and we follow the calculation in the item-based recommendation algorithm.

Specifically, for a user $u$ and the recommended item $i$, we first calculate the similarity between item $i$ and all the items user $u$ has watched and liked, measured by the cosine similarity of movie representations in BiasedMF. Then, we select the item $i^{\prime }$ with the highest similarity.

\begin{equation*} i^{\prime } = \mathop {\arg \max }_{i^{\prime } \in \mathcal {I}_u \ \cap \ liked_{u,i^{\prime }}} sim(i, i^{\prime }) \end{equation*}

4.3.3 Popularity-based Explanation (Pop-based).

The explanation is based on the popularity of the recommended item. Here we build a textual template for recommending movie $i$ as [n] users have watched [i] (Top [k%] in popularity). Specifically, the generation of the explanation statements relies on two values: the number $n$ of users who have watched movie $i$, and the percentage $k\%$ of popularity rank among all the movies.

\begin{equation*} n = \sum _{u \in \mathcal {U}} I(watched_{u,i}) \quad k = \frac{rank(i)}{|\mathcal {I}|} \end{equation*}

where the $\mathcal {U}$ and $\mathcal {I}$ are the set of users and items in Movielens dataset.

4.3.4 Content-based Explanation (Content-based).

The explanation shows the attributes of the recommended item. Here we build a textual template for recommending movie $i$ as [i] is recommended to you because it is [directed by David Fincher]. The main part is to choose the attribute from the recommended item’s attribute set, based on the user’s preference.

Specifically, we first measure the general distribution (probability of occurrence) $\mu _{g}=\lbrace \mu _{g}^{a_1},..\mu _{g}^{a_{|\mathcal {A}|}}\rbrace$ of all the attributes among items. After that, for user $u$ and the recommended item $i$, we measure the user $u$’s preferences on the attributes by the distribution of watched attributes. $\mu _{u}=\lbrace \mu _{u}^{a_1},..\mu _{u}^{a_{|\mathcal {A}|}}\rbrace$. We calculate the utility of an attribute $a$ for user $u$ as the Relative Entropy between the user’s preference and the general preference on the attribute.

\begin{equation*} RE(\mu _{u}^{a}, \mu _{g}^{a}) = \mu _{u}^{a} \log \left(\frac{\mu _{u}^{a}}{\mu _{g}^{a}}\right) + (1-\mu _{u}^{a}) \log \left(\frac{1 - \mu _{u}^{a}}{1 - \mu _{g}^{a}}\right) \end{equation*}

\begin{equation*} a = \mathop {\arg \max }_{a \in \mathcal {A}_i \ \cap \ \mu _{u}^a \gt 0} RE(\mu _{u}^{a}, \mu _{g}^{a}) \end{equation*}

As this chosen method considers both the general preference and the personal interest of user $u$, we name it C-personal. For further comparison, we also design two variants: (a) C-random: randomly select attribute $a$ of the recommended movie $i$, used for controlling; and (b) C-nonpersonal, only based on the general preference on attributes $\mu _g$, and choose the top one of item $i$.

Besides these system-generated content-based explanations, in order to study how well humans do the content explanation, we also generate a version named C-peer. As described in Section 3.2, peers are asked to choose one attribute which is most attractive to the user. Here we use the chosen attribute to generate the content explanation. Hence, for the Content-based explanation, four variants of explanation are generated altogether. If there are duplicates, only one of them is kept to present in the user study.

4.4 Participants

We recruit 39 user participants (20 females, 87% are of age 18–25) for the first and third sessions of the study (preference elicitation and explanation examination sessions). The users and peers are mainly college students, of different genders and majors. The user experiment was conducted in Chinese as the primary language (movie information, and explanations), which is the native language of the participants. There were no barriers or differences in reading. Most of the participants have a high interest (like: 36%, very like: 41%) in watching movies, 84.6% of them watch more than 10 movies yearly. For the second session of the study (peer explaining), we further recruit 16 peer participants (different from the user participants, eight females, age ranging from age 18–25). Each peer explains the recommendations for two or three users according to the length of time. Each user and peer is rewarded with US $15 as an incentive. In total, we collect 39 users’ self-explanations for 310 items (user-item pairs), and their experiences for corresponding 2,536 explanations, which includes 310 peer-explanations from 16 peers. The platform and all the collected data will be released for further research.

5 Effects of Explanations

In this section, we investigate the effects of recommendation explanations through the changes in user intention and expectations along with multiple phases, including before and after the users are presented with an explanation, and after the content is examined.

5.1 Effects on User Intention and Expectation

Explanations are expected to increase user intentions to acquire more information about the item (usually followed by clicking), or even increase user intention to consume the item (such as watching in movie scenarios, purchasing in e-commerce scenarios, and listening in music scenario). We first investigate such a hypothesis and compare the effects of different types of explanations on user intention, including both examining and watching. Meanwhile, we further inspect the impact on the user’s expectation of the preference on the recommendation.

We split the recommendation interaction into three phases: Pre-Explanation, Post-Explanation, and Post-Examination phases, and measure the effects of explanations as the change of intention along the three phases. Generally, we examine if examining intention, consuming intention, and expected preference differ amongst phases using MANOVA. The Pillai’s Trace test statistics is significant (Pillai’s trace $= 0.0132$, $F(3,5068) = 22.62$, $p\ll 0.01$). Then, the differences among phases of each user experience are further tested by one-way ANOVA: examining intention ($F=5.85$, $p\lt 0.05$); consuming intention ($F=147.7$, $p\ll 0.01$; expected preference ($F=68.2$, $p\ll 0.01$).

Furthermore, different explanation types, such as popularity-based or content-based, are expected to have different effects on the users. To investigate such differences, in Figure 5, we further show the mean of user intention and expectations along the three phases of different explanation types. From this figure, we can first observe that when facing different types of explanations, user intention changes differently. Reading explanations (Peer, User-based, Item-based) highly increase one’s intention, and making the expectation approaching user’s actual experience after-examination. This observation confirms the ability of explanations to help users make decisions and reducing their cognitive load [10, 49]. However, after some other explanations (non-personal and random Content-based) the result is the opposite.

Fig. 5.

Quantitatively, we examine if the change of user examining intention and consuming intention after shown explanation are associated with different explanation types using MANOVA. The Pillai’s Trace test statistics is significant (Pillai’s trace $= 0.0570$, $F(16,5054) = 9.2729$, $p\ll 0.01$) and demonstrate that different explanation types have significantly different impacts. Moreover, this phenomenon also indicates that explanations are not always helpful; if it is improper, the user’s intention and expectations may even decline. Among these explanations, peer explanation performs well in increasing user intention (examining intention from 3.565 to 3.919, $p$ < 0.01, consuming intention from 3.235 to 3.703, $p$ < 0.01), helping users construct the expectation (from 3.574 to 3.861, $p$ < 0.01).

5.2 Effects on User Perceptions

Beside intention, users perceive explanations in different aspects, including Transparency [3, 67], Persuasiveness [7, 18, 64], Novelty, Accuracy, and Satisfaction [4, 42, 53, 66]. In the user study, we directly collect user feedback on these perceptions in the after-explanation and after-examination phases. Note that these perceptions are measuring user experiences on the explanation, but not on the recommended item. The summary of results is shown in Table 3.

Table 3.

Types	Trans.	Persu.	Novelty	Accuracy	Satisfaction
Pop-based	3.2265(1.029)	3.2194(1.09)	0.9097(0.287)	3.3226(0.855)	3.2419(0.894)
User-based	3.6022(0.879)	3.6523(0.900)	0.9462(0.226)	3.7097(0.804)	3.5771(0.861)
Item-based	3.2355(1.014)	3.3409(0.981)	0.9484(0.222)	3.5290(0.791)	3.1774(0.991)
Item-based (peer)	3.6018**(0.987)	3.6549**(0.894)	0.8761*(0.331)	3.7522*(0.785)	3.4956**(0.974)
Content (random)	2.7980(0.962)	2.8091(0.980)	0.8258(0.380)	3.0548(0.855)	2.9484(0.883)
Content (nonper.)	2.6990(0.921)	2.6839(0.943)	0.8645(0.343)	2.9645(0.833)	2.8710(0.868)
Content (personal)	3.1650(0.876)	3.1683(0.949)	0.6181(0.487)	3.5437(0.753)	3.3107(0.822)
Content (peer)	3.2887(0.894)	3.3018(0.931)	0.6596(0.475)	3.5579(0.797)	3.3439(0.908)
Peer	3.5903(0.890)	3.7000(1.038)	0.7935(0.405)	3.7516**(0.948)	3.7871**(0.989)

Table 3. User’s Perceptions of Transparency, Persuasiveness, Novelty, Interest Accuracy, and Satisfaction on Different Types of Explanation

Report Mean (SD). (The $t$-test is used to examine the differences between peer-generated explanations and corresponding system-generated ones, masked by *( $p$ < 0.05) or **( $p$ < 0.01). Note: Content (peer) vs. Content (personal), Peer vs. well-performed User-based.

Transparency, with the 5-scale Likert question “Based on this explanation, I understand why this movie is recommended”, measures the extent to of the user perceiving how the item is recommended. Among the explanation types, User-based explanation is perceived as the highest transparency by users. Peer-generated text and item explanations are also perceived as highly transparent, while random and non-personal content-based explanations perform the worse. Persuasiveness measures the convincing effect of the explanation. Peer-explanation (textual) and peer-generated item-based explanation, are perceived as higher persuasive than the others.

Novelty, with a binary-scale question “Do you know this information in advance?”, measures whether the explanation brings new information to users. We find that peer-explanation does not achieve the highest novelty, even lower than item-based and content-based explanations, which reflects that the familiar or common information can also be used as recommendation explanations, attracting and satisfying users.

As for general user satisfaction, collected by the question “Are you satisfied with this explanation”, measures the overall experience of the user with shown explanations. Results show that users are more satisfied with the peer-generated explanations, significantly higher than the second user-based explanation.

There could also be a high correlation between user perceptions and the change of user intention and expectation (from pre-explanation to post-explanation phases). Higher persuasive explanations are expected to increase user examination and watch intention. We further investigate the correlation between perceptions and intentions, shown in Table 4. As expected, persuasiveness is positively correlated with an increase in user intentions. The impact on user examining intention is higher than on consuming intention and preference expectation. When shown with highly persuasive explanations, users will increase their intention and provide more interactions (e.g., clicking, watching).

Table 4.

Perce.	Pre-Expl. $\rightarrow$ Post-Expl.			Satisfaction
Perce.	$\Delta$(examine)	$\Delta$(watch)	$\Delta$(preference)	Satisfaction
Transparency	0.3763**	0.3884**	0.4005**	0.6127**
Persuasiveness	0.5390**	0.5221**	0.4864**	0.7086**
Novelty	0.0242	0.0186	–0.0071*	–0.1463**
Accuracy	0.4346**	0.4326**	0.4412**	0.7796**

Table 4. The Pearson’s Correlation $r$ between user Perceptions and the Changes of User Intention and Expectation

*means $p$-value of $r$ is less than 0.01.

5.3 Summary

In summary, through user intention in multi-phases, we find explanations significantly affect user intention (both for examining and watching) and user expectation. Some explanations, e.g., peer explanation, increase user intention and make user expectation approach the actual preference. While some other explanations, e.g., non-personal generated content explanation, even decrease user intentions and expectation. These phenomenons indicate that explanations should be carefully designed, yet raise some questions like, why do different explanations have different or even opposite effects?, and why do human-generated explanations perform better than system-generated explanations? Addressing these questions can help systems understand the reasons and optimization targets for providing more persuasive explanations. In the following sections, we make an in-depth study to approach these questions.

6 Factors Related to the Effects of Explanations

6.1 Differences Within Explanation Type

As shown in the previous analysis, users’ experiences vary with different types of explanations. Hence it is natural to raise further questions like “do explanations have different effects on different users?” and “do all users prefer peer explanations?”. In this section, we further inspect user experience within each explanation type.

In order to confirm the existence of different experiences within each type of explanation, we first show the user satisfaction distribution of each explanation type (see Figure 6). We can observe that there still exists a non-neglectable dissatisfied portion (satisfaction $\le 3$) in the well-performed explanation types, such as User-based and Peer. As for the poorly performed types, such as Content-based and Item-based, there are higher proportions of the dissatisfied part.

Fig. 6.

The observations indicate that different experiences exist between and within explanation types. A more deep understanding of why some explanations satisfy or dissatisfy the users can explain such differences and provide insights on design better explanations. To this end, we analyze the effects of some factors on user satisfaction for explanations, including information accuracy (i.e., consistency with self-explanation), and type-specific factors like familiarity, trust, and so on.

6.2 Factor: Information Accuracy

Explanations try to attract users by showing information about the item, e.g., a popularity-based explanation shows item popularity, an item-based explanation shows the similarity between the items, a content-based explanation shows the attribute semantics, and so on. To begin with, we investigate whether the performance of the explanation is related to the information accuracy, which is defined as whether the shown information matches the user’s self-explanation. User Self-explanation on Information Points. In the user study, we collect user self-explanation (“Why do/don’t you want to watch this movie?”) in two ways: one is the free text written by the user, while the other is the user’s annotations on information points. Specifically, we split the attributes of the movie (genres, directors, writers, casts, countries, popularity) into a single information point (e.g., a movie with three genres has three separate points), and ask the users to label each information point by their experience: negative, neutral, positive.

Statistics of the number of information points in self-explanation are summarized in Figure 7. When the user has the highest watch intention, there are about 6.21 (44.7%) positive points, while when the user has very less watch intention, there exist only 2.62 (17.4%) positive points. The number of positive and negative points are highly related to the user’s watch intention. In the peer explaining session, we also ask the peers to estimate the user’s preference on the same information points. For comparison, we also show the statistics of a peer’s annotation in Figure 7 (lower part). The analogous results suggest the external peer can estimate another user’s preference for the information of a certain item.

Fig. 7.

Measurements of Information Accuracy. Based on the collected feedback on information points, we measure the information accuracy between the explanation $e$ and user self-explanation $e_{self}$ by information point-level through matching metrics:

\begin{equation*} ACC_{\lt self,pos\gt }(e) = \sum _{p \in \mathcal {P}_{i}} I\big (p \in e_{self}^{pos} \cap p \in e\big) \end{equation*}

$ACC_{\lt self,neg\gt }$ and $ACC_{\lt self,neutral\gt }$ are calculated in the similar equation.

Information Accuracy vs. Satisfaction. We first inspect the explanations dissatisfied /satisfied by the user, and measure the information accuracy based on self-annotations. The results are shown in Table 5. Higher satisfying explanations are more likely to match the user’s positive points in self-explanations. For example, in the content-based explanations that satisfy users, 89.5% match user positive information points. Oppositely, we also observe that in the content-based explanations that dissatisfy users, only 13.7% match one of the user’s positive points. This indicates the significant trends between accuracy measurement by the points and user’s satisfaction on the corresponding explanations.

Table 5.

	Satisfaction	$ACC_{\lt self,pos\gt }$	$ACC_{\lt self,neutral\gt }$	$ACC_{\lt self,neg\gt }$
Content-based	<3 (22.3%)	0.1365	0.7232	0.1402
	=3 (45.5%)	0.2065	0.7518	0.0417
	>3 (32.2%)	0.8951	0.0946	0.0102
Pop-based	<3 (20.3%)	0.2222	0.4286	0.3492
	=3 (37.7%)	0.3504	0.5897	0.0598
	>3 (41.9%)	0.9385	0.0462	0.0154
Item-based	<3 (22.0%)	0.5914	0.3763	0.0323
	=3 (29.1%)	0.4959	0.4878	0.0163
	>3 (48.9%)	0.8357	0.1594	0.0048

Table 5. The Distribution of the Consistency between System Explanations and User’s Self-explanations for Information Points Across Different Satisfaction Levels

The consistency metrics are highly consistent with user satisfaction, which indicates that information consistency is a key factor related to the effects of explanations.

To further demonstrate the ability to use these accuracy metrics to measure user satisfaction (change to binary, $\le$3 refers dissatisfaction, $\ge$refers satisfaction), we conduct the satisfaction estimation experiment based on them. The results are shown in Table 6. A high correlation is observed when using information accuracy based on positive and neutral points to estimate satisfaction. As users’ self-explanation is hard to collect in real scenarios, we further investigate whether peer annotation can be a substitute, and find the lower but still significant correlations.

Table 6.

		$r$ within Content.	$r$ within Pop.	$r$ in Item.
Based Self-Exp.	$ACC_{\lt self,pos\gt }$	0.5746 ${}^{**}$	0.5929 ${}^{**}$	0.2616 ${}^{**}$
	$ACC_{\lt self,neutral\gt }$	0.4851 ${}^{**}$	0.3514 ${}^{**}$	0.2360 ${}^{**}$
	$ACC_{\lt self,neg\gt }$	0.1817 ${}^{**}$	0.4278 ${}^{**}$	0.1121 ${}^{**}$
	$ACC_{self}$	0.5576 ${}^{**}$	0.6293 ${}^{**}$	0.2700 ${}^{**}$
Based Peer-Exp.	$ACC_{\lt peer,pos\gt }$	0.2431 ${}^{**}$	0.1798 ${}^{**}$	0.1082 ${}^{*}$
	$ACC_{\lt peer,neutral\gt }$	0.2475 ${}^{**}$	0.0910	0.1054 ${}^{*}$
	$ACC_{\lt peer,neg\gt }$	–0.0104	0.1733 ${}^{**}$	0.0033
	$ACC_{peer}$	0.2153 ${}^{**}$	0.2132 ${}^{**}$	0.0902

Table 6. Correlations between Information Accuracy and User Satisfaction within Content-based, Popularity-based and Item-based Explanations

As Regression, measured by Pearson’s $r$.

6.3 Other Factors

In the user study, we collected users’ explicit textual reasons for their satisfaction on each explanation in the post-examination phase at the third session (“why satisfied or dissatisfied” below the satisfaction question, shown in Figure 4). We further categorize the reasons into different groups within each explanation type, and show the distribution of user satisfaction. We then investigate the main reasons behind such observations.

Content-based Explanation: Unfamiliar. From user feedback for the dissatisfaction reasons shown in Table 7, we can observe that most of the users are dissatisfied with the content-based explanations because of their unfamiliarity with the information which could bring difficulty for user to understand. For example, showing an unknown cast as the explanation may have no or even negative effects.

Table 7.

	Dissatisfied	Satisfied
Cont.	1) Unknown information (62%) 2) Too broad and not specific (14.8%) 3) No interest (11.8%)	1) Interested in the information (88.6%) 2) Help decision (12.0%)
Item	1) No obvious similarity between AB. (63%) 2) Not interested in A (12%) 3) Lack of information (8%) 4) Unaffected by others (8%)	1) Interested in A (84%) 2) Understandable similarity between A & B (9%) 3) Help acquire information (11%)
User	1) Unaffected by other people (38.9%) 2) The ratio is low (33.3%) 3) Too high to trust (16.7%)	1) The ratio is high (41.7%) 2) Trust the data (24.2%) 3) Useful information, help decision (15.2%)
Pop	1) Unaffected by popularity (45%) 2) Low popularity score (30%) 3) Not enough, need content information (4%)	1) High popularity (93%) 2) Prefer unpopular movies (4%)

Table 7. Top Reasons Behind User Dissatisfaction and Satisfaction within Each Explanation Type (Collected in Post-examination Phase, Below the Satisfaction Question)

Familiarity has been found to be highly related to users’ trust in recommender systems [26]. In order to verify the effects of familiarity on users’ acceptance of explanations, we further conduct an analysis to inspect the user perceptions when they are unfamiliar with the information shown in content-based explanations. The results are shown in Figure 8. When users are unfamiliar with the information shown in the explanation, they perceive lower transparency, persuasiveness, accuracy and overall satisfaction. Such observation indicates that unfamiliarity is negatively related to user experience on content-based explanations.

Item-based Explanation: Similarity.The item-based explanation gives information about the relation between the user’s historical item and the recommended item. Intuitively thinking, when users are confused about the similarity between both items, the explanation may be less effective.

Fig. 8.

In our user study, we find that item-based explanation performs not as well as other system-based explanations. From the users’ feedback for the evaluation reasons shown in Table 7, we can observe that most of the users (63% of them) are dissatisfied with the item-based explanations because they find it hard to understand the similarity between two items. Such a finding demonstrates the prerequisites of the item-based explanation, including the understandable similarity between items and the high preference for the historical item.

User-based Explanation.Users who say they are not affected by other people are less satisfied with the user-based explanation. Such reason constructs the major reason for dissatisfaction (38.9%). Another main reason is the user’s trust. User-based explanations with a low confidence degree, i.e., the ratio mentioned in the explanation text, are of low persuasiveness (33.3%). However, when the value is too high, the user may lose trust in the system (16.7%). These observations indicate that users’ satisfaction on user-based explanations is personalized and potentially correlated to users’ personality and trust in the system.

Popularity-based Explanation. According to the feedback, different levels of acceptance of popularity-based explanations are observed among users. 45% of dissatisfied participants gave feedback that they were not affected by the popularity (note that the background of the participants correlates with this proportion and may not reflect the general public, but the existence of this factor is reasonable). It reflects the potential association between user psychology, like conformity, and preferences for recommendation explanations. Since the such association is implicit and personalized, its verification of it requires further work in the future.

In order to investigate the effects of the shown popularity levels, we split the explanations into five groups based on the rank ratio, and show user’s satisfaction and change of intention in Figure 9. The results indicate that users prefer the most popular items, but have a tail-raising phenomenon for unpopular items. Some users are satisfied with the explanation because of their preference for unpopular movies (4%, Table 7).

Fig. 9.

7 Learn from Human Explanations

From the previous analysis, we have observed that human-generated peer explanations can better satisfy user satisfaction, and have higher persuasiveness on user intention. In this section, we further study an interesting question: “How do humans explain? What can we learn from human-generated peer-explanations and self-explanations?”, which is the third research question of this paper, including both what information is used and how much information is used. Besides, at the end of this section, we also conduct a preliminary study about the application of “Can we further improve the system explanations?”.

7.1 Annotation

Different types of explanations refer to different types of information, e.g., content-based explanations introduce the attributes of the item, and the popularity-based explanations focus on item popularity. We wonder what information humans prefer when explaining recommendations. From the user study, we have collected both peer-explanation and self-explanation, all in the form of natural language. To label the information used in these human-generated explanations, we conduct a third-party annotation.

Before the annotation, we first decide the label set of information types to annotate. Basically, we include the types of studied explanations, such as content (i.e., genre, director, etc.) and popularity. Furthermore, we conduct an open questionnaire to cover more frequent types. The questionnaire includes two questions, “what information is considered in choosing movies? (self)” and “what information is usually used when friends recommend a movie? (peer)”.

We post the questionnaire on social media, and have received 98 valid replies. As for the first question about self perspective, plot (65%), genre (64%), and word-of-mouth (61%) are most answered. Interestingly, in the second question about peer explanation, feel (53%) replaced genre as top common answers, with plot (68%) and word-of-mouth (54%). Based on the results of the questionnaire, we build the labels for further annotation, as summarized in Table 8.

Table 8.

Instruction of Annotations
Please annotate the “INFORMATION INCLUDED IN THE EXPLANATION” You can choose multiple labels.
Genre	Country	Director	Cast	Vision	Music
64.3%/36.7%	15.3%/11.2%	28.6%/22.5%	43.9%/31.6%	28.6%/20.4%	17.4%/13.3%
Plot	Feel	Popularity	Word-of-Mouth	Source	Similar Movie
65.3%/68.4%	28.6%/53.1%	32.7%/19.4%	61.2%/54.1%	15.3%/13.3%	10.2%/13.3%

Table 8. Annotation Instruction and Candidate Labels

The ratio under the labels are the statistics of the two questions in the pre-annotation questionnaire: “what information is considered in choosing movies? (self)” and “what information is usually used when friends recommend a movie? (peer)”.

Three annotators are recruited to label the information used in each self-explanation and peer-explanation collected in the user study. Some cases are shown in Table 9. To verify its reliability, we further measure the inter-annotators agreement through Krippendorff’s alpha, reaching a moderate level (for peer-explanation $\alpha = 0.6015$, for self-explanation $\alpha = 0.5375$). For each explanation, we merge the annotation from three annotators by majority voting, and filter the ones that cannot reach consensus (three in self-explanation, one in peer-explanation). Finally, for the 309 peer-explanations and 307 self-explanations, we have annotation labels about the information included.

Table 9.

Peer-Explanation	Info.	Self-Explanation	Info.
“This is a real story that happened during a war. In those dark days, saving one person is saving the world.”	plot	This movie is popular and highly rated, which also plays an vital role in the movie history, so it is worth seeing	popularity, w.o.m.
“Very creative sci-fi movie with a tense plot and great look.”	feel, genre	I feel this is an old-fashioned story with few ratings	plot, popularity
“Interesting drama movie, drug crime suspense plot.”	genre, feel, plot	Not interested in crime and suspense films but prefer clips.	genre
“This is a crime mystery film with a smooth narrative and various plots. It is a classic film directed by Brian Singh.”	genre, director, w.o.m, feel	I like the movies from Germany and USA, especially crime movies. The story is very attractive.	genre, country, plot

Table 9. Cases for Peer-explanations and Self-explanations, as Well as the Corresponding Information Points Labelled, e.g., Plot, Feel, Genre, and so on

7.2 What Information

Based on the annotation data, we firstly analyze what information is used in both peer-explanation and self-explanation. The distribution is shown in Figure 10. As for the peer-explanation, plot, genre, and feel are most widely used. Such information is less incorporated in previous recommendation explanations, revealing the potential direction that the system explanation can include more content- and feeling-related information to make its performance closer to the human.

Fig. 10.

Besides, we have observed that there exist differences in the information usage between peer-explanations and user self-explanations. For example, self-explanation focus on plot, genre, and casts more than peer-explanation.

Furthermore, we investigate the correlation between the usage of each type of information in peer-explanation, and user satisfaction. We measure the average user satisfaction of the peer-explanations which includes a specific information type, as shown in Table 10. Interestingly, the explanations including Word-of-Mouth have mostly satisfied users (avg. satisfaction 4.25).

Table 10.

Type	Satisfaction(SD)
WOM	4.250(0.74)
Vision	4.200(0.87)
Director	4.095(0.84)
Country	3.966(0.89)
Genre	3.843(0.97)
Feel	3.745(1.00)
Source	3.714(1.20)
Plot	3.486(1.07)
Music	n < 10
Popularity	n < 10
Cast	n < 10
Similar Movie	n < 10

Table 10. The Means of User Satisfaction of the Peer-explanations which Includes Different Information Types

In summary, based on the annotation data, we are able to investigate what kind of patterns there are in the usage of information types when humans perform recommendation explanations. We find that information such as content, feelings, and Word-of-Mouth are widely used and are closely correlated to user satisfaction. However, this information is still less incorporated in existing recommendation explanations, revealing a potential direction for improvement.

7.3 How Much Information

The second question is “how much information is used”. Based on the annotation data, we investigate the number of information aspects used in both peer-explanations and self-explanations. Generally, humans usually explain the recommendation with more than one point of information (>1, 66.0%).

For a more comprehensive view, we measure the number of information points included in peer-explanation, and inspect its impacts on user experience, through the changes of user intention and expectation between pre-explanation and post-explanation phase, as well as user satisfaction (details see Section 5.1). The results are shown in Table 11.

Table 11.

In peer-explanations, close to 41.1% of peers explain the recommendation with more than two points of information. And as the number of information points increases, users are more satisfied with the explanation. More detailed, when peers explain with only one piece of information, they usually choose the plot information (58.7%), and feel information (16.3%). As the amount of information increases, genre and Word-of-Mouth information are used more.

In summary, we find some patterns of human-generated explanations: (1) The mostly used information types in peer-explanation are plot, genres, and feel; some are not yet included in current recommendation explanations. (2) Most peer-explanations use hybrid information, and users are more satisfied as the variety of information increases. Besides, similar patterns are found in self-explanations (shown in Table 11), which mostly include more than two information points, and prefer plot as the main aspect.

7.4 Preliminary Application

Inspired by the findings of how human explain, we further conduct an experiment to explore the implications of these patterns. Specifically, we design the explanation generation method, namely Human-Inspired Explanation (HIE), and evaluate their performance by an extended user evaluation.

7.4.1 Generation Method.

The human-inspired explanation (HIE) aims to simulate how humans explain in terms of type and amount of information. As peer and user self behave differently in explanations, we first design the HIE(peer) and HIE(self), respectively. Based on the findings about the information usage patterns when a human explains, here we generate explanations including hybrid information (and more use plot, genre, and feel). As the aim of this extended study is to discover and verify the implication of the analysis findings, not to propose a new explanation generation method, here we design a simple keyword-based method.

Given a recommendation $i$, the HIE(peer) (or HIE(self)) method will first sample three information types based on the distribution similar to that of peers (or users). After that, the particular value of each information type is generated based on the item’s profile and the reviews. The attribute information can be directly collected, e.g., genre. while the other information, such as feel, Word-of-Mouth, and plot, are generated by a keywords-based method.

Specifically, the procedure is listed as follows, and the example is shown in Table 12:

Table 12.

Define (with examples)		Procedure (for movie $i$)	Example result
Type ( $T$)	{genre, plot, feel, director,.}	Sample three types	genre, director, feel
Keywords ( $W_t$)	e.g., for feel: {touching, exciting, ...}	Extract matched keywords of movie $i$ based on TF-IDF weight calculated by its description and reviews. $K_i$	genre: science fiction director: James Cameron feel: exciting
Patterns	Pattern for (genre, director, feel) This [genre] movie is directed by [director], and is [feel]	Fill the pattern	This science fiction movie is directed by James Cameron, and is very exciting.

Table 12. The Description and Example of the Generation Method of Human-Inspired Explanation

(1)

Sample for three types $T=\lbrace t_1, t_2, t_3\rbrace$ of information based on the probability learned in peer-explanation for HIE(peer) (or self-explanation for HIE(self)).

(2)

For each type of information $t$, given the general keyword set $W_{t}$ and the item $i$’s texts corpus $C_{i}$ (i.e., description and reviews), calculate the TF-IDF weights of each keyword. Choose the top-k keywords as $K_{i, t}$

(3)

Combine the extracted keywords $K_{i} = \bigcup \nolimits _{t \in T} K_{i, t}$

(4)

Generate the textual explanation based on pre-defined templates, e.g., the movie is + [keywords].

Beside, we combine the HIE(peer) and the best-performing system-generated explanation (user-based), namely HIE(peer)+User, trying to achieve even better performance.

7.4.2 Experiment.

In order to evaluate the performance of HIE explanation, we call back the user participants in the previous user study (after one week), and ask them to evaluate these newly generated explanations again (another session: extended user study, same users as the first session).

In this further study, we simplify the procedure, each user only needs to browse these explanations, and evaluate them with the 5-scale user satisfaction (same interface as the one used in the Post-Examining phase, see Figure 3). To have a baseline for comparison, we also re-show the peer explanations that performed best in previous sessions. In total, there were 33 user participants (/39) attending this study session, resulting in 738 evaluation results for the newly generated human-inspired explanations.

7.4.3 Experiment Results.

The evaluation results are shown in Figure 11. The re-evaluated satisfaction of peer-explanation is almost the same as the results in the previous sessions (3.73 vs. 3.78, $p$-value > 0.05), indicating that the re-evaluated results are comparable to the first-session results. We re-show the results of system- and peer- explanations (on the same user group) in the figure as hollow bars for comparison.

Fig. 11.

Firstly, we observe that the explanation learning from peer (HIE(peer)) and user self (HIE(self)) perform better than most of system explanation (except user-based). Moreover, compared to learning from peers, learning from self-explanation achieves better improvements. Besides, the combination of the peer-inspired explanation and the best-performing user-based system explanation (HIE(peer)+User) achieves further better performance, and significantly outperforms both HIE(self) and User-based ($p$ < 0.01 by t-test). Such performance even approaches the bound of human-generated peer-explanation. We leave more fine-grained and complex Human-Inspired Explanations as future work.

To summarize, by comprehensively analyzing human-generated explanations, we find humans explain with some information that has not been used in previous explanation methods, e.g., plot and feel. Meanwhile, users are more satisfied with multiple information types, instead of one piece of information offered by the existing methods. Moreover, we designed a study to examine the implication of learning from humans, and find the proposed Human-Inspired Explanation can achieve better satisfaction, even approaching the peer-explanation. These findings reveal the potential future directions of generating more human-like recommendation explanations.

8 Discussion

8.1 Summary on Research Questions

8.1.1 RQ1: How Does the Recommendation Explanation Affect users’ Experience, in Terms of Intentions for Further Interactions and Expectations of Content?.

Through tracking user intention in multi-phases, we found the user’s intention changes and is affected by the shown explanations. Some explanations, e.g., user-based or peer explanation, increase user intention for examining and watching, and make user expectations approach the final preference. Conversely, some explanations, e.g., non-personal content-based explanations, may have negative impacts. Previously, explanations were considered to be useful to users, or at least non-negative. However, we found that explanations may also reduce user intention or even mislead their expectations. This indicates that explanations cannot be presented arbitrarily and the inappropriate ones can even be negative.

Compared to system-generated explanations, human (peer)’s explanations perform much better in attracting (persuasiveness) and satisfying (satisfaction) users. Such findings are also consistent with item-based and content-based explanations.

8.1.2 RQ2: What are the Factors Related to User Satisfaction on Explanations?.

Differences in the impacts on user intention and satisfaction are observed between and within the explanation types. To understand the reasons and the related factors behind such differences, we analyze the effects of information accuracy and type-specific factors. Through users’ self-explanations, the information accuracy of the explanation is measured, and is further found to be highly related to user satisfaction. Moreover, a user’s feedback for the reasons of satisfaction gives us more detailed factors within each type of explanation, such as unfamiliarity for the content-based explanation, unclear similarity for the item-based explanation, and so on. Besides, more potential factors include style of representation (textual, human-like degree), and a participant’s background (such as age, expertise) are very interesting and worth further studies.

These findings have two implications. On the one hand, the information accuracy based on point-level self-explanation is reusable for newly generated explanations and can be measured offline. The high correlation between information accuracy and user satisfaction reveals the possibility of self-explanation based offline evaluation for recommendation explanations. Moreover, the accurate annotations from external people also show the alternative ability. In future work, we will investigate how well external assessors can perceive the user’s preferences and provide ground-truth annotations for explanations. On the other hand, the mining of type-specific factors suggests some directions for improvement. Moreover, some of the factors, e.g., personality, reveal the existence of personal preferences on explanation types, and motivate the improvements in generating satisfying explanations.

8.1.3 RQ3: How do Humans Explain? What Can we Learn from Human-generated Peer-explanations and Self-explanations? Can we Further Improve the System Explanations?.

As found in the first research question, human-generated explanations are of the highest persuasiveness and satisfaction, indicating the possibility of improving explanations by learning from humans. Through inspecting both peer- and self-explanations, we found humans usually use hybrid information (amount) and the most used information is about semantic information, e.g., plot and feels (content). Furthermore, we design several straightforward Human-Inspired Explanation (HIE) generation methods, and evaluate their effectiveness by an extended user experiment. The improvements in user satisfaction, even approaching the peer-explanation, indicate the usability of incorporating human patterns in explanation generation.

8.2 Limitations

The results presented in this work are based on the user study in movie recommendation, which is widely studied in the literature. Although some attributes (e.g., director) are specific in the movie domain, the category of information (e.g., content, popularity, similarity, etc.) are similar in general, like music, news, and e-commerce. Hence, our findings and methodologies have some generalization ability to other domains.

Other methods for generating explanations exist, such as social-based or knowledge-based, and the like. Since the main goal of this work is not to summarize and evaluate all types of explanations, and also due to the information limitations of the Movielens data, we have selected only the most widely used and standard types of explanations (as in [29]). In the future, more types of explanations, including generation-based, can be further studied. Besides, the collected dataset, including the user ground-truth (self- and peer- explanations) for explanation generation, has implications for offline training and evaluating different explanation methods.

9 Conclusion and Future Work

In this work, through an in-depth user study that collects the user’s self-explanation for the first time, we study how and why explanations affect user intention, expectation, and perceptions. Both positive and negative impacts are observed between and within different explanation types. Compared to system-generated explanations, human explanations perform better in attracting and satisfying users. By comparing with user self-explanation, we identify and measure the key factors related to user satisfaction on explanations, including the information accuracy and type-specific factors.

Moreover, we analyze how humans explain in terms of the content and amount of information used, and conclude that humans use hybrid and more diversified semantic information. Learning from both peer-explanation and self-explanation, we design Human-Inspired Explanations. The satisfaction results from an extended user evaluation experiment demonstrate the effectiveness of incorporating human explanation patterns. More powerful and complex human-inspired explanation methods are left for future work and can be expected to have better performance, even exceeding humans.

This work provides insights into the understanding and evaluation of recommendation explanations, and also reveals some directions for future works.

•

Offline evaluation of the recommendation explanation is still a gap in this field of research. Whether newly generated explanations can satisfy users usually relies on user study in previous practices, which are much less efficient. In this work, we explored and found that the more realistic user-approved explanation can be obtained through self-explanation, and further demonstrated that user satisfaction with the recommendation explanations is related to the information accuracy. Hence, adding the collected self-explanations (with the form: <user, item, user-self-explanation>) to the current classic recommendation dataset (e.g., Movielens) is expected to enable offline evaluation for the model-generated explanations (with the form: <user, item, generated-explanation>). The data collected in this paper provides a preliminary small-size evaluation set. In future work, a more reliable and sensitive offline evaluation method can be proposed by further study and examination.

•

Besides the impacts of recommendation explanation on the user’s intention and perceptions, studying its influences on user behaviors, such as clicking and decisions, is also important and helpful for developing online evaluation.

•

More factors behind explanations related to user satisfaction worth further experiment and investigation.

•

Users are found to have personal preferences on different explanation types, and related to users’ personality and psychology. Hence, how to learn the user preference for different explanation types and provide personalized explanation services is worth further studying.

Footnotes

The interfaces used in user study are in Chinese, and here we translate them into English.

https://github.com/THUIR.

References

[1]

Behnoush Abdollahi and Olfa Nasraoui. 2017. Using explainability for constrained matrix factorization. In Proceedings of the Eleventh ACM Conference on Recommender Systems. 79–83.

First Step: User Preference Elicitation (to be answered by Users)
For each movie \(i\) user watched (Continuous selecting until 15 movies)
Rating, Review	Please rate this movie with your preference, why? (5-scale Likert)
Second Step: Peer Explanation (to be answered by Peers)
Based on the movies user \(u\)’s rated, for each recommended movie:
Peer-explanation	What would you say if you recommend this movie to him/her?
Peer-explanation (points)	Please label the following attributes by their effects on the user’s decision. (3-scale, negative, moderate, positive)
Peer-explanation (content)	Which attribute of the movie do you think convinces the user the most? (choice)
Peer-explanation (item)	Among the movies the user watched, which one do you think is most similar? (choice)
Third Step: Explanation Evaluation (to be answered by Users)
Phase I: Pre-Explanation
Each recommendation is shown with title and its poster only
Examine intention	Would you like to know more about this movie? why? (5-scale Likert)
Consume intention	Would you like to watch this movie? why? (5-scale Likert)
Expected preference	What are your expectations of your preference for this movie? why? (5-scale Likert)
Phase II: Post-Explanation
\| For each explanation [shuffled]
\| Examine intention	After reading this explanation, would you like to know about this movie now? (5-scale Likert)
\| Consume intention	Would you like to watch this movie now? (5-scale Likert)
\| Expected preference	What are your expectation of your preference on this movie now? (5-scale Likert)
\| Perceived persuasiveness	This explanation is convincing to me. (5-scale Likert)
\| Perceived transparency	Based on this explanation, I understand why this movie is recommended. (5-scale Likert)
\| Perceived novelty	Did you know this information in advance? (binary)
Phase III: Examining
The detail information of this movie is shown
Consume intention	After reading the details of the movie, would you like to watch it? (5-scale Likert)
Self-explanation	Why do/don’t you want to watch this movie?
Self-explanation (points)	Please label the following attributes of this movie by their effects on your decision. (3-scale negative, moderate, positive)
Phase IV: Post-Examining
\| For each explanation [re-shuffled]
\| Perceived accuracy	This explanation is consistent with my interest. (5-scale Likert)
\| Satisfaction	Are you satisfied with this explanation? why? (5-scale Likert)

Perce.	Pre-Expl. \(\rightarrow\) Post-Expl.			Satisfaction
Perce.	\(\Delta\)(examine)	\(\Delta\)(watch)	\(\Delta\)(preference)	Satisfaction
Transparency	0.3763**	0.3884**	0.4005**	0.6127**
Persuasiveness	0.5390**	0.5221**	0.4864**	0.7086**
Novelty	0.0242	0.0186	–0.0071*	–0.1463**
Accuracy	0.4346**	0.4326**	0.4412**	0.7796**

	Satisfaction	\(ACC_{\lt self,pos\gt }\)	\(ACC_{\lt self,neutral\gt }\)	\(ACC_{\lt self,neg\gt }\)
Content-based	<3 (22.3%)	0.1365	0.7232	0.1402
	=3 (45.5%)	0.2065	0.7518	0.0417
	>3 (32.2%)	0.8951	0.0946	0.0102
Pop-based	<3 (20.3%)	0.2222	0.4286	0.3492
	=3 (37.7%)	0.3504	0.5897	0.0598
	>3 (41.9%)	0.9385	0.0462	0.0154
Item-based	<3 (22.0%)	0.5914	0.3763	0.0323
	=3 (29.1%)	0.4959	0.4878	0.0163
	>3 (48.9%)	0.8357	0.1594	0.0048

		\(r\) within Content.	\(r\) within Pop.	\(r\) in Item.
Based Self-Exp.	\(ACC_{\lt self,pos\gt }\)	0.5746 \({}^{**}\)	0.5929 \({}^{**}\)	0.2616 \({}^{**}\)
	\(ACC_{\lt self,neutral\gt }\)	0.4851 \({}^{**}\)	0.3514 \({}^{**}\)	0.2360 \({}^{**}\)
	\(ACC_{\lt self,neg\gt }\)	0.1817 \({}^{**}\)	0.4278 \({}^{**}\)	0.1121 \({}^{**}\)
	\(ACC_{self}\)	0.5576 \({}^{**}\)	0.6293 \({}^{**}\)	0.2700 \({}^{**}\)
Based Peer-Exp.	\(ACC_{\lt peer,pos\gt }\)	0.2431 \({}^{**}\)	0.1798 \({}^{**}\)	0.1082 \({}^{*}\)
	\(ACC_{\lt peer,neutral\gt }\)	0.2475 \({}^{**}\)	0.0910	0.1054 \({}^{*}\)
	\(ACC_{\lt peer,neg\gt }\)	–0.0104	0.1733 \({}^{**}\)	0.0033
	\(ACC_{peer}\)	0.2153 \({}^{**}\)	0.2132 \({}^{**}\)	0.0902

Define (with examples)		Procedure (for movie \(i\))	Example result
Type ( \(T\))	{genre, plot, feel, director,.}	Sample three types	genre, director, feel
Keywords ( \(W_t\))	e.g., for feel: {touching, exciting, ...}	Extract matched keywords of movie \(i\) based on TF-IDF weight calculated by its description and reviews. \(K_i\)	genre: science fiction director: James Cameron feel: exciting
Patterns	Pattern for (genre, director, feel) This [genre] movie is directed by [director], and is [feel]	Fill the pattern	This science fiction movie is directed by James Cameron, and is very exciting.

Abstract

1 Introduction

2 Related Work

2.1 Explanations in Recommender Systems

2.2 Effects of Recommendation Explanations

2.3 Evaluation of Recommendation Explanations

3 User Study Procedure

3.1 Session 1: User Preference Elicitation

3.2 Session 2: Peer Explanation

3.3 Session 3: User Examination for Explanations

4 User Study Setups

4.1 Movie Dataset

4.2 Candidate Items Generation

4.3 Explanations Generation

4.3.1 User-based Explanation (User-based).

4.3.2 Item-based Explanation (Item-based).

4.3.3 Popularity-based Explanation (Pop-based).

4.3.4 Content-based Explanation (Content-based).

4.4 Participants

5 Effects of Explanations

5.1 Effects on User Intention and Expectation

5.2 Effects on User Perceptions

5.3 Summary

6 Factors Related to the Effects of Explanations

6.1 Differences Within Explanation Type

6.2 Factor: Information Accuracy

6.3 Other Factors

7 Learn from Human Explanations

7.1 Annotation

7.2 What Information

7.3 How Much Information

7.4 Preliminary Application

7.4.1 Generation Method.

7.4.2 Experiment.

7.4.3 Experiment Results.

8 Discussion

8.1 Summary on Research Questions

8.1.1 RQ1: How Does the Recommendation Explanation Affect users’ Experience, in Terms of Intentions for Further Interactions and Expectations of Content?.

8.1.2 RQ2: What are the Factors Related to User Satisfaction on Explanations?.

8.1.3 RQ3: How do Humans Explain? What Can we Learn from Human-generated Peer-explanations and Self-explanations? Can we Further Improve the System Explanations?.

8.2 Limitations

9 Conclusion and Future Work

Footnotes

References

Cited By

Index Terms

Recommendations

On-demand Personalized Explanation for Transparent Recommendation

User-Centric Item Characteristics for Modeling Users and Improving Recommendations

Attention-driven Factor Model for Explainable Personalized Recommendation

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations