Keywords

1 Introduction and Related Work

A satisfying user experience is one of the main objectives for any mobile application. However, assessing user experience is no trivial task at all. It depends mainly on subjective opinions, earlier experiences or preferences of the single users [6]. To this end, several questionnaires to rate the user experience or the usability of an application by real users have been created over the years, such as AttrakDiff [5], SuS [2] or UEQ [10]. Traditionally, real users are asked to rate their experience with a system using such questionnaires, typically in laboratory test situations. This can be a tedious and expensive procedure. Thus, in recent years several approaches have been developed to formally describe and predict user behavior and ratings for a given system without having to ask real users [3, 9, 11]. It is, however, not an easy task either, to model user behavior reliably. There is always the challenge of capturing cultural, gender or even only dexterity differences in the ratings of the users.

In recent years touch interface devices such as smartphones and tablets found a wide distribution in all parts of the world. Most recently it has been shown that it is possible to predict the emotional state of users by analyzing their touch interaction [12]. These findings indicate that touch interaction seems to be a much more intuitive interaction modality than the previous examined desktop point-and-click interaction [7].

In this work we propose a combination of both the above approaches: Using the ratings from the AttrakDiff Mini questionnaire [4] and real user’s touch interaction traces to train a model that can predict the user experience ratings. The AttrakDiff Mini questionnare yields three quality dimensions: pragmatic quality (PQ), hedonic quality (HQ) and attractiveness (ATT). The pragmatic quality is connected to the usability of the application and gives insight in whether the application’s interface is suited to fulfill a task. The hedonic quality describes whether the user felt that his or her individual needs (identification or stimulation) were met by the application. Attractiveness is a synthesis of PQ and HQ. We expect that at least the pragmatic quality aspects of user experience should be predictable from touch interactions as touches are a direct consequence of how easy an application is to use.

2 Methods

As a first step towards predicting user experience ratings from touch interactions, a study with 31 test participants was conducted. The test participants were 16 females (15 right-handed, 1 left-handed) and 15 males (all right-handed). The test participants were asked to use a mobile application and rate it. To this end, we used the AttrakDiff Mini questionnaire [4]. It has only 10 items und such is more likely to yield a high quality on repeated subjective measures.

The mobile application used in this study [1] provides a tracking framework that allows to log interactions with the application. Two games were presented to the test participants through the mobile application: Spell and Quiz. We chose games to get meaningful contributions from functional as well as non-functional aspects of user experience. In Spell a word of varying length needs to be spelled. The letters are spread over the screen and the user has to drag them to target tiles in the correct order. In Quiz questions are asked and four possible answers are provided. The user chooses between them by tapping on one of them.

These two games provide solely single-touch interactions. In Spell mainly swipe interactions are performed while spelling a word by dragging and dropping the letters. This will be referred to as drag and drop interaction. In Quiz mainly taps were performed while answering questions in the quiz game by tapping on the answers. This will be referred to as point and tap interaction. To cover a variety of user experiences and corresponding ratings three versions of the app were presented: Normal (the app was presented as it was originally intended to work and look like), TinyIcons (the app’s look and control was degraded by providing too small control elements) and Freezing (the app’s response time was slowed down with delayed interaction feedback).

This resulted in six different test conditions for the study: (1) Quiz-Normal, (2) Quiz-Freezing, (3) Quiz-TinyIcons, (4) Spell-Normal, (5) Spell-Freezing and (6) Spell-TinyIcons. These six conditions were presented to each test participant in two sets each containing only one game. In each set all three degradation versions were randomly presented and rated twice resulting in six sessions per set. Each of these sessions lasted two minutes. Between the sets a five minutes break was granted. The starting game was alternated between all test participants. This resulted in 372 ratings for each quality dimension in total.

To be able to extract features, the interaction intervals within a session were subdivided into tasks and interactions. A task consists of choosing the correct answer or dragging the correct letter to the target tile. One task can be composed of several interactions such as choosing any answer or moving any letter anywhere but the correct target tile.

The features that were chosen to be extracted from the tracking can be grouped into touch counts, touch measures and performance related features. The touch counts features are counts of number of times the screen was touched, a control element was missed or hit during an interaction, a task or a session. The touch measures are calculated from the positions and timestamps of the tracking points, such as swipe length and speed, duration of a single touch, interaction or task, duration between two touches or the distance between a touch and the center of a control element. The performance related features are defined by the task logic and these are number of correct answers and number of completed tasks per session. The latter two feature types were aggregated for each session by calculating descriptive statistics on them. This resulted in 107 features in total for the modeling of AttrakDiff’s quality dimensions.

A closer look at the data revealed that for the condition Freezing some touches were lost for the first 20 test participants. Thus, in the analysis, we will concentrate on predicting the quality dimensions on the data containing only the conditions Normal and TinyIcons.

As a first step, appropriate features for the linear regression model need to be selected. This is done with forward selection [8], which starts with one feature and successively adds more features that provide the best residual sum of squares (RSS). To measure the goodness of the linear models using different numbers of features, adjusted \(r^2\) and the Bayesian information criterion (BIC) were used. The adjusted \(r^2\) relates to the fraction of the data’s variance that is explained by the linear model (\(r^2\)), but takes the number of features selected for the model into account. It punishes overfitting using too many features. The BIC is another measure that penalizes too many features in the model. In model selection, a model with high adjusted \(r^2\) and low BIC is preferred. An additional information on the expected prediction performance of a linear model is given by the residual standard error (RSE). It gives insight into how much the real outcomes deviate in average from the regression line of the model. The quality dimension ratings range from 1 to 7, so a RSE at approximately 1 or below should be sufficient to estimate a user rating.

After analyzing the whole data set, the prediction performance of the linear models containing different numbers of features were tested. To this end, the data set was divided into a training and test set at a ratio of two thirds to one third. The feature extraction was done on the training set and the best trained model was evaluated using the mean squared error (MSE) on the test set. The MSE is the average deviation of the outcomes predicted by the model and the real outcomes from the test set. Following the reasoning for the RSE, a value of approximately 1 indicates a good performance for the MSE as well.

3 Results

The top of Fig. 1 shows the different ratings for the quality dimensions pragmatic quality (PQ), hedonic quality (HQ) and attractiveness (ATT) for the conditions Normal and TinyIcons for both games and Spell and Quiz separately. The normal versions of the apps were rated better and with less variation than the diminished versions in all three dimensions. The small icons lessened the ratings of all three quality dimensions in Spell (drag and drop interaction). For Quiz (point and tap interaction) only the ratings for the attractiveness are especially distinct.

Fig. 1.
figure 1

Top: distribution of the ratings for pragmatic quality (PQ), hedonic quality (HQ) and attractiveness (ATT) for the conditions Normal and TinyIcons. The first row shows the ratings for both games, the second for Spell and the last for Quiz. Bottom: exemplary features that were extracted from the touch interaction tracking.

The extracted touch interaction features differ as well for the two app versions. This is shown in the bottom of Fig. 1 for four exemplary features. The number of interactions performed in one session is higher for the undisturbed versions while the number of missed control elements is higher for too small control elements. The number of control elements that were touched per interaction in average per session do not differ very much. The touches per interaction, however, are higher for the diminished condition.

Fig. 2.
figure 2

First row: the adjusted \(r^2\) for linear regressions against the number of selected features for pragmatic quality (PQ), hedonic quality (HQ) and attractiveness (ATT). The first column shows results for both games combined, the second for Spell and the last for Quiz. Second row: the Bayesian information criterion (BIC) for the three quality dimensions. The black dots indicate the optimal number of features selected and the resulting RSE. The empty circles in the upper plots indicate the corresponding adjusted \(r^2\) for the number of features selected by minimizing the BIC.

The findings presented in Fig. 1 suggest indeed that it should be possible to predict the quality ratings from the touch interactions. Figure 2 shows the results from the feature selection for linear regression models.

In the upper row the adjusted \(r^2\) values for PQ, HQ and ATT are plotted against the number of features used for the linear model. The black dots indicate the number of features (p) that lead to a maximum for the adjusted \(r^2\) and the residual standard error (RSE) for the corresponding model. The number of features that are selected maximizing the adjusted \(r^2\) is very high (between 31 and 70 features on 124 and 248 data points). This could lead to an overfitting of the data. Thus, to determine a smaller number of features, the Bayesian information criterion (BIC) was minimized. As the lower row of Fig. 2 shows, the minima of the BIC indicate much less features for the models while still having an acceptable RSE between 0.41 and 1.12. The empty circles in the upper plots indicate the corresponding adjusted \(r^2\) for the number of features selected by minimizing the BIC. It can be seen that the adjusted \(r^2\) is still quite high as the adjusted \(r^2\) is flattening at that point such that the gain in adjusted \(r^2\) is not big from this point on compared to the danger of overfitting by adding more and more features.

Fig. 3.
figure 3

Mean squared error (MSE) for test outcomes predicted by a model trained on training data. The first column shows results for both games combined, the second for Spell and the last for Quiz.

Overall, it can be seen that for Spell and both games combined the pragmatic quality can be modeled best. For Quiz the attractiveness is modeled best. To get an impression on whether the prediction performance can be generalized, the data set was split into a training and a test set. Figure 3 shows the mean squared errors (MSE) for the prediction of test outcomes from models trained with different numbers of features. The smallest MSE indicates the best model. This approach, too, suggests a small set of features for prediction for all three quality dimensions. The black dots indicate the best model and the fraction of variance of the test data that is explained by the trained model, \(r^2\). For HQ and ATT it is very low - about 0.4 and lower - for all apps. For PQ, however, it is between 0.51 and 0.64.

4 Discussion

The ratings for Quiz (Fig. 1, top) indicate that diminishing an interface that is mainly used for pointing and tapping with much too small control elements reduces the attractiveness, but it does not influence the mean values of the pragmatic or hedonic quality ratings much. This could lead to the conclusion that the small control elements do not reduce the pragmatic quality of a point and tap game, but only its attractiveness, e.g. through reduced readability. For a drag and drop game (e.g., Spell) this disruption reduces the pragmatic quality as it is much harder to perform a task.

To verify this, the results from the Freezing condition should be viewed at least for the last 10 test participants where the recording worked.

The linear models trained on the Quiz data yield the highest values of the adjusted \(r^2\) for attractiveness (Fig. 2) while on the other data sets the pragmatic quality is modeled best. This, too, could be explained by the differences in the ratings from Fig. 1. Looking at the prediction performance on a new data set (Fig. 3), however, only the variance in the pragmatic quality ratings can be predicted to a satisfying fraction, \(r^2\) = 0.64 and an MSE of below 1. This tendency can also be seen for Spell and both games combined. Whether this is due to the model simply guessing between the highest peaks of the rating’s distributions (Fig. 1) or that the touch features used in this work can be used to predict pragmatic quality needs to be examined further. These features, however, seem not to be suited to model hedonic quality and thus, attractiveness. It will be interesting to define and extract further touch features from the recorded interactions that might lead to a better model of hedonic quality and attractiveness. The linear regressions for both games seem to “follow” the outcomes for the linear regressions for the spell game. This could mean that the features predicting best for Spell also predict well for Quiz.

The features that actually were selected for the different models presented above were quite different. It was not possible to find any features that seemed more connected to predicting especially PQ, HQ or ATT. However, there were tendencies to which features seem to be more connected to a certain quality dimension, such as the number of interactions per session known from Fig. 1 (bottom, first column), which counts how often the user had to interact with the system until a task was completed seems more connected to pragmatic quality. The number of missed control elements (bottom, column 2) seems to be slightly more connected to the attractiveness. The average number of control elements touched per task (bottom, column 3) are more relevant to hedonic quality and attractiveness and finally the average number of interactions per session (bottom, column 4) seems to be connected to all three quality dimensions.

5 Conclusion and Outlook

This work demonstrated that it might be possible to assess the pragmatic quality of a user interface degraded by too small control elements from the recorded touch behavior of the users. It yielded a \(r^2\) of up to 0.64. It also showed that the user experience of different interaction methods are more (drag and drop) or less (point and tap) sensible to small control elements.

The results presented in this work indicate an interesting direction for future research. They strengthen the hypothesis that it is possible to predict subjective opinions from user touch interaction. It is, however, subject to future research to determine which kind of disruption of a user interface influences which kind of interaction methods. It was found that for a point and tap interaction (Quiz) the small control elements influence the attractiveness dimension. For a drag and drop interaction (Spell) it influences all quality dimensions. Another future research question is which features can reflect this rating differences best in the predictions. As shown in Fig. 3, the features selected for this work have a good performance for predicting the pragmatic quality for both interaction methods. They perform poorly on predicting the other dimensions. The range of the ratings, however, was limited and their distribution peaked at similar ratings. Further research should strife to collect more uniformly distributed user ratings.

To enhance the prediction performance for the quality ratings, other prediction methods, such as non-linear regressions, neural networks or support vector machine regression could be applied.

Still, it should be addressed which features are good predictors for hedonic quality and attractiveness.