Keywords

1 Introduction

User experience design often involves analyzing subjective preferences of various stakeholders, especially end users and customers. By analyzing preferences of different stakeholders, we can not only derive concrete requirements for good usability, but also set appropriate priorities for design activities. This is often necessary in practice because there may not be enough time and resources to comply with all potential interests, for example, to carry out all design proposals or to fulfill all requirements perfectly. More importantly, having a clear view of preferences from different perspectives can help us discover and deal with interest conflicts. Even if the stakeholders involved in the design processes agree on what should be realized, they may disagree on which one is more important. Therefore, the information on preferences and their structure serves as input for strategic decision making.

To ensure a good quality of the input for decision making, a high degree of accuracy in scaling, that is, an interval or ratio scale, is advantageous. First, it provides richer information about the magnitude of latent preferences (e.g. how much one option is preferred to another). Second, it enables precise comparisons of different stakeholders’ perspectives on a quantitative level (e.g. correlations can be calculated as a similarity measure). Finally, managerial decision making needs accurate quantitative information to optimize the resource allocation in design processes (e.g. how much of the budget should be invested in realizing the top three priorities).Footnote 1

Accordingly, user experience design demands accurate and efficient techniques of gathering and analyzing subjective preferences. However, in practice, measuring preferences is a vital yet challenging activity due to time and resource constraints. Among the approaches commonly used in practice, those which provide more accurate and reliable information (e.g. paired comparison) are costly, whereas those which require less time and resources (e.g. direct ranking) yield less precise information (see next sections for a thorough discussion).

To solve this problem, we combine the response collection technique of stepwise ranking and the data analysis technique for paired comparison. That is, to collect preference data using the ranking-by-elimination [1] procedure, derive pairwise judgment data from the ranking data and then estimate ratio-scaled values of the latent preference from the derived data based on the Bradley-Terry-Luce model [2, 3]. In this way, our method inherits the advantage of both techniques – providing ratio-scaled results while being economic. The analytic procedure is capable of inferring ratio-scaled estimates of the latent preference from ordinal judgment data because the Bradley-Terry-Luce model makes specific assumptions (see Sect. 2.1). By transforming the ranking data into pairwise judgments, a consistency assumption is made additionally. Therefore, the accuracy of the estimates depends on the extent to which these assumptions hold in the specific application context. Practitioners applying the method can get insight into this issue with the aid of the model’s goodness of fit and the correlation with genuine paired comparison, respectively.

In the following, we will outline this method and exemplify how to apply it in practice by presenting an empirical study as specific example. In the study, the proposed method was applied to the prioritization of quality requirements on UI texts and the results were compared with those of genuine pairwise judgment. We conducted the study to demonstrate the applicability as well as to test the validity and efficiency of the method empirically. Based on the results, we will discuss the possible reasons for the different preferences of various stakeholders as well as the prospects and issues of applying this method in broader contexts.

2 Background

2.1 How to Infer Ratio-Scaled Estimates of Preferences from Complete Paired Comparison Data

In paired comparison, an individual expresses his or her preference between two alternatives at a time. Compared to rating and other methods yielding directly interval-scaled data, paired comparison has several advantages. Psychometrically, it has higher reliability and can reveal inconsistent judgments [4]. It measures more accurately because it is able to identify minimal differences in preferences [5] and it also avoids some biases and problems of direct rating [5, 6], such as unstable reference point [6]. In addition, paired comparison creates less cognitive load than ranking or other methods do because it is easier for participants to make dichotomous judgments between pairs than to assign numbers to each alternative. References [7, 8] provide practical examples for successful application of the paired comparison method in the context of UI design.

Although a single comparison collects only the ordinary information on the preference for one pair of alternatives (i.e. not including the information “how much A is better than B”), the data of all possible pairwise combinations (complete paired comparison) can result in accurate and reliable ratio-scaled estimates for preferences if based on proper models. These models assume that the alternatives are ordered on a latent preference scale and there is a probabilistic relationship between the positions of the alternatives on the preference scale (i.e. the scale values) and the observable choices in paired comparison. Thus, by analyzing the observed pattern of pairwise judgments, the unobservable magnitudes of the preferences of all alternatives can be inferred. One of these models, the Bradley-Terry-Luce model [2, 3], assumes that from each specific pair A and B, the probability of choosing A equals the proportion of the latent value of A to the sum of the values of A and B. Furthermore, it assumes that the probabilities of each pairwise judgment are mutually independent. Therefore, the probability of observing a specific choice pattern is proportional to the product of the probabilities of all the pairwise judgments (for a detailed discussion see e.g. [4, 9]; for mathematical proof, see [2, 3]). The Bradley-Terry-Luce model belongs to the family of generalized linear models. Hence, the scale values as model parameters can be estimated by the means of conventional estimation methods for logit or log-linear models [10]. The estimates are to be interpreted as the positions of the options on the assumed one-dimensional preference scale. According to the model, they are ratio-scaled and hence justify statements such as “A is twice as [good, beautiful, usable, …] as B”. Because the model is essentially probabilistic, confidence intervals of the estimates are computed to take account of sampling errors.

Besides the property of providing ratio-scaled values, another advantage of model-based analysis is the testability of models. The goodness of fit of the model describes how well it fits the observed data. For example, the measure G2 represents the difference between the values predicted by the estimated preferences and the observed data. The goodness-of-fit test computes then the probability of observing this difference (or a larger one) given that the model describes the empirical data adequately. If it is very unlikely to observe the difference, then the model is not appropriate to describe the data and the estimated scale values (model parameters) are not valid in this case. Thus, unlike direct rating, which relies on implicit and untested assumptions [6], model-based analysis of paired comparison data provides quantitative information about how good the estimated scale values are in a given application context.

The largest drawback of paired comparison is its inefficiency in terms of information obtained per unit of time [4]. If the number of alternatives is large, it takes much time to compare all alternatives in pairs (n(n − 1)/2 comparisons for n alternatives). In addition, it can also cause fatigue and boredom of participants.

2.2 Ranking-By-Elimination as a Preference Collection Method

Ranking as a data elicitation method has the advantage of requiring less time and resources while remaining straightforward, which makes it favorable for practical and commercial applications. However, analyzed by means of common non-parametric methods, ranking data yield merely ordinal scaled estimates, because in a single rank order, the information “how much better” is not included as well. Moreover, rank order data have the problem that individuals usually pay more attention to the top few choices rather than carefully ranking all alternatives, resulting in additional noise in the lower rankings (e.g. [11]). This is related to the increased cognitive load of ranking. With increasing number of alternatives, participants usually experience difficulties in putting all the alternatives in a rank order. Some may spend a long time reconsidering and altering their rankings [1].

To reduce the cognitive load and to ensure participants’ attention for all alternatives, [1] proposed a new ranking procedure called ranking-by-elimination. In this procedure, participants identify the least preferred alternative at one time and this option will be then irrevocably eliminated from the list of alternatives. The stepwise elimination repeats until only one option is left. Empirical studies show that ranking by elimination is slightly faster than common ranking procedures and yields similar results [1].

2.3 The Combined Method

The idea of combining ranking procedures and the analysis technique for paired comparison exists already in the literature and related empirical research has also been done in various contexts (cf. [4, 5, 12]). The combined method is based on connecting ranking data to paired comparison by regarding a rank order as the result of a specific paired comparison pattern. More specifically, given a rank order of n objects, the object with the highest rank must always be preferred when compared with all other (n − 1) objects. The second highest ranked object then must be favored in the comparisons with all n − 2 objects ranked below it. In this way, a response pattern of the complete paired comparison can be uniquely derived from the given rank order. After this, preference scale values can be estimated using the Bradley-Terry-Luce model based on the derived pairwise judgments. In other words, the combined method “converts the ranking data into paired comparison data, and then work on it, as if it was paired comparison data.” ([9], p. 35) As the ranking procedure requires less time and resources, the total cost reduces.

The price to pay is a potential information loss and the violation of one of the model assumptions. In genuine paired comparison, an individual might prefer A to B, B to C and C to A because people do not always judge consistently. However, such cases are not possible in pairwise judgments derived from ranking data because individuals are forced to make a full ranking of the objects instead of judging pair by pair. In this way, the possible patterns of pairwise judgments are limited to a small subset by the transformation. Moreover, requiring a strict consistency may violate the assumption that the pairwise judgments are mutually independent. Hence, the transformation might lead to some distortions in the results.

Whether the consistency assumption is met in a given dataset is an empirical question. Although the model fit can tell us how good the model assumptions are met for the transformed data, there is no such an a priori measure quantifying the degree in which the transformation is appropriate. Therefore, we collected data using both rank-by-elimination and genuine paired comparison and compared the results.

2.4 The Example of Prioritizing Quality Requirements on UI Texts

We applied both rank-by-elimination and genuine paired comparison as data collection procedure to investigate a specific question in UI design. The application context was a project investigating quality factors of UI texts. One of the project’s goals was to scale the relative importance of various quality requirements for UI texts from the perspectives of various stakeholder groups, namely end users, decision makers and information developers. The research questions were how subjective expectations of UI text properties can be efficiently measured and whether there are meaningful discrepancies between the priorities of the main recipients and those who create UI texts.

Because the proposed method is applicable to scaling the relative preferences regarding a given set of objects, the objects must first be identified. We therefore constructed a set of UI text requirements based on a sample of existing design guidelines for UI texts by clustering and synthesizing all the relevant requirements in them. We assume that the guidelines reflect representatively the requirements identified in the earlier phase of requirement gathering. Nevertheless, we do not claim that the design guidelines used in our study have exhausted all the requirements that contribute to forming a quality judgment of UI texts.

3 Methods

3.1 Participants

In total, twelve end users, ten decision makers and eighteen information developers took part in the study. The end users and decision makers were recruited via an external agency based on a detailed screening document. The information developers were employees of a software company.

3.2 Material

The stimulus material consisted of screen mockups presenting fifteen concrete quality requirements on UI texts. The quality requirements used were: “correct spelling”, “correct capitalization”, “correct grammar”, “correct punctuation”, “no abbreviations”, “idiomatic language”, “simple language”, “clear language”, “parallel construction”, “chronological order”, “active voice and direct address”, “consistent terminology”, “not state the obvious”, “focused on goal of user” and “necessary information embedded on UI”. They were presented with the aid of concrete mockup examples. All the examples used the same UI (creation of leave requests) and each example showed a realistic case that violates the requirement to illustrate our understanding of the requirement (see Fig. 1). We assume that the more important a requirement, the more severe its violation. So the ranking of the importance of a requirement can be seen as equivalent to the reverse ranking of the according violation.

Fig. 1.
figure 1

Screen mockup for the quality requirement “simple language”

The stimuli used in the ranking were printed copies of the mockup examples. The ones used in the paired comparison were presented on a display. To reduce the total expenditure, each participant compared only eight requirements in the paired comparison procedure. Consequently, there were two sets of stimuli for genuine paired comparison (set A and set B, with the stimulus “consistent terminology” appearing in both sets). One of these sets was randomly assigned to each participant.

3.3 Procedure

First, the moderator explained each quality requirement using the mockup examples printed on paper. The participants were instructed to make their judgments based on the general understanding of each requirement instead of the specific example. After that, participants completed paired comparison using the software tool PXLab [13]. Each requirement was combined with all other seven requirements from the same set and each pair appeared twice, with balanced left-right position. All pairs appeared in a random order. Following the paired comparison, participants ranked all fifteen requirements on a whiteboard. In the rank-by-elimination procedure, they were asked to pick the requirement whose violation was considered least severe and remove it from the list. Then they ought to identify the least severe one from the remaining list. This process repeated until only one item was left on the whiteboard. In addition, participants were instructed to think aloud during the ranking by elimination procedure.

3.4 Statistical Analysis

The ranking data was transformed in the manner described in Sect. 2.3 and then analyzed based on the Bradley-Terry-Luce model using the R-package “eba” [14].

4 Results

The model fit of the Bradley-Terry-Luce model was good, G 2(91) = 56.57, p = 1 for the entire sample, G 2(91) = 61.88, p = .99 for end users, G 2(91) = 47.02, p = 1 for decision makers and G 2(91) = 48.05, p = 1 for information developers. This indicates that the model assumptions are likely to be appropriate for the pairwise judgments derived from the ranking data.

More importantly, our method yields very similar results to genuine paired comparison. Substantial correlations were found between the scale values based on the transformed ranking data and those based on the genuine paired comparison data. In the total sample, the correlation was r = .84 for set A and r = .98 for set B. Correlations in the stakeholder groups ranged from .47 to .96 and four out of the six correlations were above .80.

The estimated scale values are illustrated in Table 1 in the appendix. The importance of two requirements differs from each other significantly, if their confidence intervals do not overlap. Within each group, the requirements vary strongly in their perceived importance. The overall pattern can be roughly described as three levels: high, medium and low importance. In average, the highest importance is more than the double of the medium importance and about four times of the lowest importance.

Table 1. Estimated scale values and 95 % confidence intervals. A higher value indicates a higher perceived importance of the respective requirement. The two values in the bracket below each scale value are the 95 % confidence interval with the lower limit on the left and the upper limit on the right.

For end users, the top three were “clear language”, “consistent terminology” and “necessary information on UI”, which were about one-and-a-half times as important as “idiomatic language” and “simple language” and twice as important as “correct spelling” and “correct grammar”. The significance of other requirements was less than one-third of that of the top three.

For decision makers, the requirements descended gradually in perceived importance except “necessary information on UI”, whose superiority was clear. “Consistent terminology”, “idiomatic language”, “clear language”, “simple language” and “focus on goals of user” were of medium importance, which was at least as twice as that of the remaining.

For information developers, the requirements fell clearly onto three levels. “Consistent terminology” was the only one on the highest level. The second level (“clear language”, “no abbreviations”, “correct grammar”, “correct spelling” “simple language”, “idiomatic language” and “necessary information on UI”) was at the most as half important as the first level. The lowest level was mostly only as one-third important as the second level.

As the radical chart (Fig. 2) visualizes, the discrepancies between groups were larger regarding the most important requirements. The largest divergence was found regarding “necessary information on UI”: For decision makers, it was the number one priority; for end users, it was the third most important requirement (the absolute preference reduced to 2/3) and for information developers, it was the eighth (reduced to merely 1/3). Moreover, end users chose “clear language” at the first place, followed by “consistent terminology”, while this order is reversed for information developers and decision makers. Some requirements of medium importance were clearly more valued by one or two groups. “Correct spelling” and “correct grammar” had higher priority for information developers and much lower priority for decision makers. However, both groups put more emphasis on “no abbreviations” than end users. Decision makers were the only group for which “focused on goals of user” reached a medium importance. For the other two groups, this requirement was clearly insignificant.

Fig. 2.
figure 2

Estimated scale values for the importance of the requirements. A higher value indicates a higher perceived importance of the respective requirement. In this chart, the requirements are sorted by the average scale value across groups.

Despite of the differences between groups, there was a substantial consensus. Correlations between the groups indicate that end users and decision makers agree on a lot (r = .74), to the similar degree as end users and information developers (r = .77), whereas decision makers and information developers agree on much less (r = .51). This means that the priorities of end users are quite similar to those of the other two groups, but decision makers and information developers share each other’s view to a less extent. This result pattern also indicates that what end users and decision makers agree on is quite different from what end users and information developers agree on.

5 Discussion

The application of the proposed method is considered successful. The good model fit indicates that the model assumptions were likely to hold for the transformed ranking data. The high correlations between the estimates based on transformed ranking data and those based on the genuine paired comparison data indicate that the results of different response collection methods were very similar. Therefore, the transformation was appropriate in this application case. The combination of the rank-by-elimination procedure and the analysis model for paired comparison can provide valid ratio-scaled estimations with relatively low expenditure compared to a standard pair comparison method.

With the aid of the combined method, the preference structures of different stakeholders can be revealed and compared on quantitative level. First, the findings indicate a substantial consensus among the three stakeholder groups. Specifically, the convergence between end users and decision makers was as high as that between end users and information developers and much higher than that between decision makers and information developers. Discrepancies were found above all regarding the more important requirements. End users appear to value requirements that ensure sufficient and clear information (e.g. “necessary information embedded on UI” and “clear language”) more than information developers, while information developers emphasize more formal requirements than end users (e.g. “consistent terminology”, “correct spelling” and “correct grammar”). End users may focus on concrete tasks in the first place and consider how certain requirements would affect their problem solving (e.g. “would I get stuck if…”, “what would waste my time” and “what would make me prone to mistakes”). Information developers, in spite of recognizing the importance of some requirements that are critical for users’ problem solving, put emphasis on more formal and measurable requirements. Decision makers appear to have tried to view from end users’ perspective but overestimated the significance of some requirements for end users (e.g. “necessary information embedded on UI”, “focus on goals of user” and “no abbreviations”) while underestimated that of some others (e.g. “clear language”). Moreover, they gave some formal requirements much less priority than information developers (“consistent terminology”, “correct spelling” and “correct grammar”).

As this application example illustrates, the combined method can be used to prioritize design requirements for various UI elements, given that a set of potential requirements has already been identified. Beside requirement analysis, many situations also involve subjective priorities or preferences regarding a given set of objects. Here are some examples: how important are specific goals, use cases or tasks for users; which features, configurations or design proposals do (differently profiled) users prefer. Similarly, these priorities and preferences can be measured by the combined method, even if the objects under consideration are qualitatively very different. Therefore, we argue that beside prioritization of requirements, this method is also applicable to various design activities, where decisions have to be made relying on subjective preferences. The combined method is especially efficient if the number of options is large. For example, with this method, the design questions in [7, 8] (selecting design alternatives or design proposals), which were studied by means of standard paired comparison, could have been investigated more efficiently.

As also stressed in [7, 8], using ranking or paired-comparisons should be considered whenever other more appropriate methods cannot be applied. If possible, a decision between design-alternatives should be supported by the results of a usability test or an experimental investigation. But this might be too time consuming or expensive – in this case, applying the proposed scaling methods is far better than deciding on basis of gut feeling or management verdict. In cases where we primarily investigate individual opinions or personal taste, scaling is one of the best approaches to elicit what end-users “really want”. This applies, for example, to subtle variations of visual design.

In practice, the largest weakness of this method is that the transformation may not be justified in every situation. Therefore, at least a small proportion of the participants should do the genuine paired comparison in parallel to provide validation data. The question is how to determine the proportion, which yields the best cost-benefit-ratio.