Keywords

1 Introduction

Many articles on user experience design and usability are focused on websites, e-commerce, consumer applications, and even gaming [14]. The design for a retail site must be simple and straightforward, or sales may be lost. The design for a service-provider site must capture the user’s attention immediately, or a short-attention-span user will go elsewhere. Mobile applications, games, and social media sites have the potential for exceedingly large revenue when designed to engage users in a way that keeps them coming back again and again.

Enterprise software, however, is different than consumer software. Users of enterprise software rarely, if ever, get to choose which applications they will use at their workplace. Business users’ tasks, experiences, and duration of use are different than private users’ motivations, needs, and goals. The design of the user experience of complex enterprise software should be approached differently than the design of other applications.

As effective designers realize, the design of any product should focus on its users, their tasks, and usage scenarios [57]. Accordingly, usability evaluations should focus on users, tasks, and scenarios, and should be conducted in contexts as similar as possible to real-world situations. Walk-up-and-use testing methodology is frequently used for consumer applications and websites; however, enterprise applications are more robust and intricate. These designs require a different kind of usability testing. Enterprise software that is used daily by skilled technicians needs to be both usable and efficient. Evaluating its usability via walk-up-and-use methods is ill-advised.

During the design phase of a multifaceted business application, we started to organize usability testing of the user interface. As we began drafting tasks for the usability test, it quickly became obvious that the tasks our end users would do with our software could not be easily evaluated with walk-up-and-use testing methodology. In order for us to properly assess the efficiency and intuitiveness of our user interface, we needed to measure both the learnability and the usability of our design concepts.

2 Related Work

One documented process for how to create usable enterprise software [8] includes creation of user interface guidelines, standards, and style guides, along with prototyping and conducting user studies. Think-aloud protocol, field observation, and user interviews are encouraged as methods that developers can employ to improve the usability of their products [9], and strategies can be found for assessing the usability of business intelligence applications [10], focusing primarily on heuristic evaluation and user surveys.

A literature review [11] of learnability factors in software applications provides insights on various techniques to improve learnability, and proposes providing help systems to aid users, but it does not explicitly describe methods for assessing the learnability of a software application and implications for design. A model for understanding the factors that contribute to learnability [12] was defined and tested with a future goal of specifying requirements that would improve the learnability of software; however, the focus is on the attributes of learnable software rather than the process for testing learnability in software. Although learnable software is desired, and is recognized as a critical component of usability [13], it is challenging to find a generally accepted methodology for assessing the learnability of an enterprise software application. One survey of research on learnability [14] details the various ways learnability has been defined and assessed, including 25 learnability metrics that have been used in usability evaluations. In addition, a method of “coaching” was used in a study and compared to think-aloud protocol [14]. None of the learnability testing in this body of related work compared task completion times of users over time. We sought to do just that.

3 Procedure

We recruited six user-surrogates, whose job experience and skill sets matched our target users. Each participant was tested individually. Sessions lasted three-and-a-half to four-hours each. Each session was recorded, including all participants’ comments and all screen interactions.

Sessions began with formal instructions that were read aloud by the moderator. The instructions consisted of a high-level overview of the application, a description of the testing scenario, and details about two distinct and separate concepts that are introduced in the application. Included in the explanation of the two new concepts was a comparison between them. The goal of the instructions was to give participants as much information as possible about the user interface before they began to interact with the application. Participants were given the opportunity to ask questions.

After the instructions were read, and any questions answered, users were given approximately 15 min of unstructured time to interact with the application. Participants were told to browse through the application on their own, with no specific tasks to attempt. The moderator was not present during this time. The intention of the unstructured time was to simulate what many business users do with new software: explore it before attempting to use it. Participants were allowed to click and explore the set-up of the software as they wished, while alone in the testing room. This allowed the user to become familiar with the interaction patterns and explore parts of the software that appealed to them. After the unstructured time, participants were given tasks to complete.

Tasks were conducted in three rounds. Each round consisted of 6 primary tasks. Round 1 was positioned as a participant-led training exercise. The moderator acted as a trainer, answered questions, explained how the product worked, and assisted with problems. Participants were required to read each task aloud, and attempt to complete it. If participants experienced difficulties completing a task, asked for help, or simply did not know what to do next, the moderator assisted. The moderator answered all questions, directed the users, explained the interaction paradigm, and helped users move on to their next step. This part of our procedure contrasts sharply with usability testing in which users are given no assistance while attempting to complete a task. Participants were informed that the moderator would assist with tasks only during Round 1. The moderator acted as a trainer or colleague who was knowledgeable about the application. Participants were encouraged to ask as many questions as possible during Round 1 because the moderator would not answer questions or help with the tasks during Rounds 2 or 3.

Following Round 1, participants were offered snacks and beverages, and given a 15 min break. They were escorted outside the testing room and encouraged to talk about other subjects. Our objective was to disrupt participants’ focus on the software and distract them from what they had been doing.

Round 2 consisted of the same 6 tasks as in the training round, but performed on different objects. Our intention was to emulate what a real workday could look like for those who use our product. The primary difference between Rounds 1 and 2 was that the moderator did not assist participants in Round 2. As with the training round, participants were required to read each task aloud, and attempt to complete it. However, if a participant struggled with a task, the moderator, though present in the room, did not provide help. Participants were encouraged to try to figure out, or remember, how to complete the tasks on their own. If a user could not successfully complete a task after a lengthy attempt, the moderator redirected the user to the part of the application where he/she needed to be, but did not provide any instruction or explanation. After Round 2, participants were given another 15 min break outside of the testing room.

After their second break, and prior to Round 3, participants were given a 45 min elaborate distraction task followed by their third, and final, break. The intention of the distraction task was to introduce interference, preventing users from rehearsing the previous tasks. The objective of the distraction task was similar to that of the breaks. We wanted our participants to stop thinking about the part of the software we were testing. The distraction task used a part of the application that had no crossover to the main test series.

Round 3 included the same tasks as Rounds 1 and 2, but again with different objects. The moderator was present but, again, did not support participants. With each round of tasks, the moderator played less of a role. In the first round the moderator served as a trainer, and in the subsequent rounds only provided guidance or support when the user could not otherwise complete the task.

Table 1 presents an overview of the entire testing session. The user activities are listed in order, along with their duration.

Table 1. Session Overview

4 Metrics

4.1 Task Completion Times

Two observers timed participants while they were attempting to complete the tasks. Observers started timing as the participant began to read the task, and paused timing for any application errors. After all sessions were completed, one observer watched the 24 h of video recordings multiple times to accurately collect the time on task.

4.2 Moderator Redirects

The two observers made note of all redirection provided by the moderator in Rounds 2 and 3. These data were verified again with the video recordings.

4.3 Confidence Ratings

At the conclusion of Rounds 2 and 3, participants were asked to rate each task on how confident they were that they had completed it correctly. The 7-point Likert scale went from 1 (not at all confident) to 7 (extremely confident).

4.4 Open-Ended Questions

After Round 3, participants were asked to provide written responses to two open-ended questions. The goal of these questions was to solicit qualitative feedback about the experience of using the software and to capture ideas for improvement. The questions were, “Considering your entire experience today: (1) List a few things you liked about this application; and (2) List a few things about this application that could be improved.”

4.5 The SUS

The System Usability Scale (SUS) is an easy-to-apply tool that provides a usability score for a software application [15]. It has been in use for more than two decades and has been found to be both reliable and valid [16]. After participants completed all three rounds of testing, rated their confidence of success, and answered the two open-ended questions, we asked them to complete the SUS as our final measure of usability.

5 Results

5.1 Task Completion Times

All participants demonstrated learning from the beginning of the test session to the end of the session, as evidenced by faster task completion times. As shown in Fig. 1, all 6 participants completed their tasks faster in Round 3 compared to Round 1. Five of the six participants completed the Round 2 tasks more quickly than the Round 1 tasks. The lone participant who had a slower time in Round 2 than in Round 1 spent a lot of time talking about the application while completing tasks during Round 2.

Fig. 1.
figure 1

Time (in seconds) to complete all tasks per round, per participant (Color figure online)

All 6 participants completed the tasks more quickly in Round 3 compared to Round 2. The improvement in task completion times is evidence that the overall design paradigm of our enterprise application is learnable.

To identify the features which were not easily learned, and therefore may warrant redesign efforts, we calculated average percent improvement for each task across participants. Since participants all work at their own pace, looking at raw overall times provided very little context to determine if improvement was consistently happening across the board. Instead we looked at the percent improvement so we could compare the times across participants. We calculated the percent improvement in completion time, for each task, per participant, across rounds. We then determined the average percent improvement across all participants to uncover any differences in learnability of tasks.

Participants improved their task completion times for all tasks except one. For that particular task, comprehension of two new concepts was needed. As shown in Fig. 2, completion time for Task 6 did not improve between rounds. In fact, participants were slower with this task during the last round of testing than they were during the training round.

Fig. 2.
figure 2

Average percent improvement in task completion time from Round 1 to Round 3

The average percent improvement data illustrate an underlying problem with two new concepts we were trying to introduce into the application. Task 6 required participants to understand our conceptual model and interact with new structures.

5.2 Moderator Redirects

The total number of moderator redirects was 16 for Round 2, and 7 for Round 3. Of the 16 redirects in Round 2, 11 involved the two new concepts, as did 4 of the 7 redirects in Round 3.

5.3 Confidence Ratings

The Confidence Scale ratings indicate that participants were mostly confident they had completed tasks successfully: 89 % of the tasks were rated a 5, 6 or 7, with 7 = extremely confident that they completed the task successfully. The task for which participants rated their lowest confidence of success was that which involved the two new concepts, with most participants rating it a 4 or 5.

5.4 Open-Ended Questions

Among the six participants, there were a total of nine items listed in answer to the question: List a few things about this application that could be improved. Five of the nine items mentioned the design of our two new concepts.

5.5 The SUS

According to generally accepted standards [17], products with a SUS score above 68 is considered usable. The SUS score for our software was 67.86. This score was higher than what we expected, since the version we tested did not have complete functionality. However, a SUS score of 68 is not high enough for software that will be delivered to our end users. The good news was that, because of our other metrics, we knew where the problems were in our application.

6 Discussion

Based on task times across rounds of testing, confidence ratings, and SUS scores, our product is a usable and learnable enterprise software application, with some room for improvement.

Several subtasks proved challenging to participants in Round 1. For example, participants struggled adding text to a page. The moderator had to point out an icon which needed to be selected before participants could complete the task. In a typical usability session, this observation may have led to a redesign of the text feature, or a repositioning of the icon. However, adding text was quickly performed by all participants in Rounds 2 and 3, indicating that it did not need to be redesigned. If we had conducted only one round of testing, we would not have been able to distinguish between the parts of the design which were confusing versus those which were unfamiliar but quickly learned.

Participants were able to learn the navigation, menus, and object model of our application. However, they were unable to grasp the two new concepts we had hoped to introduce in our software. In addition to a lack of improvement in task completion time for these concepts, participants expressed frustration when interacting with them. They did not understand where they were in the application, where they were putting objects, or how to move objects between different sections of the user interface.

Participants were given explicit instructions about the concepts at the beginning of the session. They were taken through a series of tasks, given feedback and guidance. Despite multiple attempts to communicate and educate, participants could not understand the distinction between these two concepts. This critical finding guided our subsequent design efforts. Instead of merely debating whether or not our end users would grasp the general ideas of these two new concepts, we had evidence that these concepts were vague and overlapping, and the design for them needed to be revisited and revised. Our testing methodology provided a clear direction for the user experience design team. This finding would not have been clear with only one round of testing, particularly if we had followed a walk-up-and-use method. If participants of a single test session struggled with our new concepts, one could contend that the concepts would be learned quickly enough, even if not immediately intuitive. Our testing procedure demonstrated that other aspects of the design were indeed learnable, but these concepts were not.

Enterprise applications are often used daily by some business users. Learning will take place as business users interact with the application to perform tasks. Identifying the design concepts which can be quickly learned versus those which cannot may have a substantial effect on the overall ease of use of the product. For complex software, conducting an evaluation of the learnability of the design - in addition to its initial usability - can help uncover deeper usability issues.