Keywords

1 Introduction

The ISO 9241 standard, updated in 2018, defines usability as the “extent to which a system, product or service can be used by specified users to achieve specified goals with effectiveness, efficiency and satisfaction in a specified context of use” [1]. As the usability concept is too general, the standard also indicates that “the specified users, goals and context of use refer to the particular combination of users, goals and context of use for which usability is being considered”. As the standard highlights, the term “usability” is “also used as a qualifier to refer to the design knowledge, competencies, activities and design attributes that contribute to usability, such as usability expertise, usability professional, usability engineering, usability method, usability evaluation, usability heuristic”. However, we think that a clear distinction should be made between “usability” as (software quality) attribute, usability evaluation and design methods, usability-related process (usability engineering) and usability professionals.

It is largely agreed that User eXperience (UX) extends the usability concept, beyond its traditional dimensions (effectiveness, efficiency and satisfaction). The same ISO 9241 standard defines UX as “user’s perceptions and responses that result from the use and/or anticipated use of a system, product or service” [1]. It also specifies that “users’ perceptions and responses include the users’ emotions, beliefs, preferences, perceptions, comfort, behaviors, and accomplishments that occur before, during and after use”.

Proposed in the early ’90s, heuristic evaluation is one of the most popular usability evaluation methods [2]. A heuristic evaluation is performed by a small group of experts (usually 3 to 5) based on a set of principles/rules/guidelines, called heuristics. Nielsen’s ten usability heuristics [3] are well known, but are often considered too general, unable to detect domain-related usability problems. That is why many other sets of heuristics were proposed [4, 5]. Heuristic evaluation may be used to asses several UX aspects, not only usability [6].

Teaching the heuristic evaluation method and forming evaluators is challenging. We think the practice is the best way to understand the heuristic evaluation protocol and the usability heuristics nature [7, 8]. We performed a comparative study on the perception of novice evaluators over Nielsen’s heuristics, involving Computer Science students from a Chilean and a Spanish university [9, 10]. This paper presents a follow-up study, including experimental results in two new case studies.

The paper is structured as follows. Section 2 introduce the “Evaluator eXperience” concept and describe the questionnaire that we developed and used for several years to assess the (novel) evaluators’ perception. Section 3 presents the experiments that we made from 2016 to 2018 on three major online travel agencies websites, Atrapalo.com [11], TripAdvisor.com [12] and Expedia.com [13]. Section 4 discusses experimental results. Section 5 highlights conclusions and future work.

2 Evaluator EXperience

Heuristic evaluators are particular kind of “users” of particular “products” (artifacts): (1) the set of usability/UX heuristics and (2) the heuristic evaluation method. Both artifacts may be evaluated in terms of their “usability”. We may think of Evaluator eXperience as a particular case of UX, which may also be assessed.

We conducted studies on the perception of evaluators over generic and specific usability heuristics for several years [14,15,16,17]. All participants are asked to perform a heuristic evaluation of the same case study. Then they are asked to participate in a post-experiment survey.

Heuristics quality is an important topic, as it highly influences the heuristic evaluation’s results. At least one heuristic quality scale was proposed [18]. We developed our own scale, a questionnaire that assesses evaluators’ perception over a set of usability heuristics, based on 4 dimensions and 3 questions:

  • D1 – Utility: How useful the heuristic is.

  • D2 – Clarity: How clear the heuristic is.

  • D3 – Ease of use: How easy was to associate identified problems to the heuristic.

  • D4 – Necessity of additional checklist: How necessary would be to complement the heuristic with a checklist.

  • Q1 – Easiness: How easy was to perform the heuristic evaluation, based on the given set of heuristics?

  • Q2 – Intention: Would you use the same set of heuristics when evaluating similar software product in the future?

  • Q3 – Completeness: Do you think the set of heuristics covers all usability aspects for this kind of software product?

Each heuristic is rated individually, on the 4 dimensions (D1 – Utility, D2 – Clarity, D3 – Ease of use, D4 – Necessity of additional checklist). But the set of heuristics is also rated globally, through the 3 questions (Q1 – Easiness, Q2 – Intention, Q3 – Completeness). In all cases, we are using a 5 points Likert scale (from 1 – worst, to 5 – best).

Additionally, two open questions are asked, to collect qualitative aspects of evaluators’ experience:

  • OQ1: What did you perceive as most difficult to perform during the heuristic evaluation?

  • OQ2: What domain-related aspects do you think the set of heuristics does not cover?

3 Experiments

We made several experiments on the perception of Nielsen’s heuristics when evaluating online travel agencies, from 2016 to 2018. The experiments involved novice evaluators, Computer Science students from Chile and Spain:

  • Graduate and undergraduate students in Informatics Engineering at Pontificia Universidad Católica de Valparaíso, Valparaíso, Chile, and

  • Undergraduate students of the Bachelor in Computer Engineering in Information Technologies at Universidad Miguel Hernandez de Elche, Elche, Spain.

All students were enrolled in Usability/UX-oriented Human-Computer Interaction introductory courses. In all cases they were asked to perform a heuristic evaluation based on Nielsen’s heuristics, following Nielsen’s protocol. With few exceptions, it was the first time they performed a heuristic evaluation; it was also their first contact with Nielsen’s heuristics and his evaluation protocol. After performing the heuristic evaluation, the students were asked to answer the questionnaire described in Sect. 2. All students participated voluntarily in the survey, there was no sample selection.

Experiments involved 112 Chilean and 31 Spanish students, as follows:

  • Atrapalo.com was evaluated by 31 Spanish undergraduate students, 17 Chilean undergraduate students, and 33 Chilean graduate students;

  • TripAdvisor.com was evaluated by 27 Chilean undergraduate students and 22 Chilean graduate students;

  • Expedia.com was evaluated by 13 Chilean undergraduate students.

Results obtained when evaluating Atrapalo.com were presented in detail in previous work [9, 10]. Section 4 synthetizes these results, describes the results obtain when evaluating TripAdvisor.com and Expedia.com, and compares them with the Atrapalo.com results.

Observations’ scale is ordinal, and no assumption of normality could be made. Therefore the survey results were analyzed using nonparametric statistics tests (Kruskal-Wallis, Mann-Whitney U and Spearman ρ). In all tests p-value ≤ 0.05 was used as decision rule.

As three groups of students (with different background) evaluated the same set of heuristics (Nielsen’s one), Kruskal-Wallis test was performed to check the hypothesis:

  • H0: there are no significant differences between the perceptions of the three groups of students,

  • H1: there are significant differences between the perceptions of the three groups of students.

Mann-Whitney U tests were performed to check the hypothesis:

  • H0: there are no significant differences between the perceptions of two groups of students,

  • H1: there are significant differences between the perceptions of two groups of students.

Spearman ρ tests were performed to check the hypothesis:

  • H0: ρ = 0, two dimensions/questions are independent,

  • H1: ρ ≠ 0, two dimensions/questions are dependent.

4 Results and Discussion

The Atrapalo.com experiments where presented in two previous papers [9, 10]. In summary, the experimental results show significant differences between the perception of Spanish and Chilean students, in several dimensions and questions (as presented in Table 1).

Table 1. Mann-Whitney U test for Spanish and Chilean students. Case study: Atrapalo.

On the contrary as described in our previous papers [9, 10], there are no significant differences between the two groups of Spanish students (participants in the experiment in 2016 and 2017), in none of the dimensions and questions. The perception of Chilean undergraduate and graduate students is also similar; there are significant differences between the group of undergraduate and the group of graduate students only regarding question Q2 (intention of future use). It seems that the level of studies (graduate/undergraduate) does not influence students’ opinion, at least in our experiment. So, there are significant differences between the groups of Spanish and Chilean students, but not really among the members of the same group.

We also noticed that Chilean students have a better opinion that their Spanish counterpart, on all dimensions and questions (Table 2). It is especially notable that even if the Chilean students have a better perception on heuristics’ utility, clarity, and ease of use, they still fill the need for additional evaluation criteria (checklist).

Table 2. Average scores for dimensions and questions. Case Study: Atrapalo.

We did not have evidences to suspect that the differences between Spanish and Chilean students are due to their background or cultural-related aspects. Based on some of the Spanish students’ comments, we identify as possible cause the methodology that was used when introducing Nielsen’s heuristics. In the case of Chilean students each heuristic is first explained by examples, and then students have to identify usability problems related to each heuristic in several case studies. The problems they identify are debated in the classroom.

As we couldn’t repeat the experiment in Spain using the same methodology as in Chile, we decided to repeat it in Chile in 2018, in three courses, using two others online travel agencies as case studies: TripAdvisor and Expedia. So we made new experiments with three groups of students:

  • A first group of 22 Chilean graduate students evaluated TripAdvisor.com;

  • A second group of 27 Chilean undergraduate students also evaluated TripAdvisor.com;

  • Finally, a third group of 13 Chilean undergraduate students evaluated Expedia.com.

All three groups were using Nielsen’s usability heuristics. The way we introduced Nielsen’s heuristics and we perform the experiments were identical as in the experiments made in Chile using Atrapalo.com as case study.

The Kruskal-Wallis test indicates no significant differences between the three groups of students, concerning dimensions D1, D2, D3 and D4, even when their background (undergraduate/graduate level), and/or the case study are different (Table 3). Significant differences occurs only on the overall perception of the heuristic evaluation method (Q1), intention of future use (Q2) and Nielsen’s set of heuristics completeness (Q3).

Table 3. Kruskal-Wallis test for three groups of Chilean students, 2018.

We then applied the Mann-Whitney U test for each pair of groups (Table 4). Results show very few significant differences:

Table 4. Mann-Whitney U test for pairs of groups of Chilean students (p-values), 2018.
  • One between undergraduate and graduate students that evaluated TripAdvisor, concerning the heuristic evaluation easiness (Q1);

  • Two between undergraduate students that evaluated Expedia versus the ones that evaluated TripAdvisor, concerning Nielsen’s heuristics ease of use (D3), and the intention of future use of Nielsen’s heuristics when evaluating online travel agencies (Q2);

  • Two between undergraduate students that evaluated Expedia versus graduate students that evaluated TripAdvisor, concerning Nielsen’s heuristics ease of use (D3) and Nielsen’s heuristics completeness (Q3).

Table 5 presents the averages scores for dimensions and questions for the three groups of Chilean students that participated in the 2018 experiment. It also includes the results of the 2017 group of students. As the opinions of all groups of Chilean students are similar, it also shows the averages scores for all Chilean students and, for comparison purpose, the averages scores for Spanish students.

Table 5. Average scores for dimensions and questions.

The four groups of Chilean students have a better perception than their Spanish counterpart in all dimensions. They perceive Nielsen’s heuristics more useful (D1), clear (D2) and easy to use (D3). But they also feel a higher necessity for additional evaluation criteria (checklist, D4). They perceive the heuristic evaluation as easier to perform, comparing to the Spanish students, excepting the group of undergraduate students that evaluated TripAdvisor. Chilean students also express a higher intention of future use of Nielsen’s heuristics (with one exception, the undergraduate students that evaluated Expedia). Concerning Nielsen’s heuristics completeness when evaluating online travel agencies, Chilean students have divided opinions; two groups have a better perception than Spanish students, but other two groups have a less favorable perception. However, when comparing the opinion of all 112 Chilean students with the opinion of the 31 Spanish students, Chilean students have a better perception in all dimensions and questions. So, new results are consistent with previous findings [9, 10].

Table 6 shows the correlations between dimensions/questions when considering the three groups of Chilean students that participated in the 2018 experiment.

Table 6. Spearman ρ test for all Chilean students (2018).

Few correlations occur when analyzing each group of Chilean students that participated in the 2018 experiments (Tables 7, 8, and 9).

Table 7. Spearman ρ test for graduate Chilean students. Case study: TripAdvisor.
Table 8. Spearman ρ test for undergraduate Chilean students. Case study: TripAdvisor.
Table 9. Spearman ρ test for undergraduate Chilean students. Case study: Expedia

As in our previous studies, few correlations occur in relatively small groups of students. When considering altogether the three groups of students, more correlations occur, and most of them are also consistent with our previous studies. The D1 – D2 correlation is particularly frequent: when heuristics’ specification is perceived as clear, heuristics are also perceived as useful.

Open questions OQ1 and OQ1 are evaluating some qualitative aspects of evaluators’ perception. What the three groups of students pointed out is similar to what students of previous generations expressed [9].

According to what the students say in their comments, the use of Nielsen’s heuristics seems to require positioning themselves in a new paradigm of thinking, to perceive and evaluate a website based on an evaluation perspective to which they are not accustomed to. In this way, the comprehension of each heuristic, its identification, adaptation and mode of application to different products, are aspects that the evaluators identify as difficult for their work.

Based on this, they highlight the importance of having elements that help familiarize themselves with both the artifacts they are using (Nielsen’s heuristics), as well as the services offered by the evaluated products (TripAdvisor and Expedia websites in this case). In this sense, the evaluators emphasize the need to count with technical reports that would provide them examples of heuristic evaluations which have been previously carried out (either by them or by others). On the other hand, the evaluators highlight that the websites they have been evaluating should provide strategies to facilitate their understanding by the people who use them (for example through tutorials), a good organization, distribution and precision of the information that helps to a better understanding. They consider that novice users are also facing the adjustment of their way of thinking and operating to what offer and allow the websites they use.

On the other hand, it is interesting that the evaluators point out that although their work consist in evaluating products, they experience difficulties to take a critical look, especially to detect problems that are not major, evident, or common. It seems that the evaluators are guided mostly by functionality and effectiveness criteria (based on the achievement of final results); they think the problems would be detected while the ongoing actions are carried out. However, following that direction, aspects of the subjective and personal experience of the real users may thus be underestimated and unattended. In this sense, it seems that the evaluators have difficulties to identify problems until they constitute complications for themselves, according to their own way of using the product and according to their own experiences. The evaluators highlight in this way that they notice difficulties in putting themselves successfully in the place of other users, especially in the place of novices. Complications seem to also arise because each evaluator has to understand others evaluators’ opinions. The evaluators emphasize that it is difficult for them to coordinate their opinions and perceptions regarding the carried out evaluations, in order to reach consent with the rest of the evaluation team.

5 Conclusions

Heuristic evaluation is probably the most popular usability inspection method, but forming evaluators is not an easy task. Heuristic evaluation results depend highly on both heuristics quality and evaluators experience. Evaluators are using specific artifacts, the set of usability/UX heuristics and the evaluation protocol. The protocol seems to be less challenging, but properly understanding and correctly applying heuristics in practice is much more demanding, especially for novel evaluators. Heuristics’ “usability” may be assessed, based on heuristics quality scale. Evaluators experience may also be assessed.

We systematically conduct studies on the perception of (novice) evaluators over generic and specific usability heuristics, based on a questionnaire that we developed. The questionnaire allows evaluating each heuristic individually (Utility, Clarity, Ease of use, Necessity of additional checklist), but also the set of heuristics as a whole (Easiness, Intention, Completeness). The questionnaire also allows expressing evaluators’ perception through comments.

In a comparative study that we have done before, we noticed significant differences between the perception of Chilean and Spanish Computer Science students when evaluating the same online travel agency (Atrapalo) based on Nielsen’s heuristics. The perception of Chilean students with different background was similar. The perception of two generations of Spanish students was also similar.

As we did not have evidences to suspect cultural or background-related issues as possible cause, we think the reason could be the methodology of introducing Nielsen’s heuristics, when teaching the heuristic evaluation method. We checked our assumption on two new case studies (TripAdvisor and Atrapalo), with three new groups of Chilean students. New results are consistent with our previous findings. Chilean students’ perception was systematically better than Spanish students’ perception.

As future work we would like to check (if possible) if the methodology that we are using with Chilean students would lead to similar results when applied to Spanish students.