Keywords

1 Introduction

Assessing the aptitude and appropriateness of software systems relative to both purpose and requirements along with evaluating their performance relative to user expectations has been a recurring theme in software engineering research for a long time. Investigations in these areas intend to contribute to improving overall system design and support the initial development and further evolution of an artifact in use so that systems better match purpose, requirements, and users’ expectations. In a more general sense, such studies help to better understand the factors, which lead to software engineering success. However, for reasons of high cost, heavy time commitments on part of both developers and user-evaluators, and institutional barriers among other hindering factors, systematic software artifact assessments and evaluations have been found difficult to conduct persistently [4].

Furthermore, numerous aspects have to be considered when assessing and evaluating software systems ranging from internal architecture and code efficiency investigations, over studies on the effectiveness of human–computer interaction to user satisfaction and usability among others so that the purposes and foci of evaluative studies can vary widely. What constitutes ultimate software engineering success, hence, is still an open debate [15]. As shown in the next section, user satisfaction and effective use-related studies have been conducted in increasing numbers in recent years; however, criteria and frameworks used in such studies are also of a wide variety making it difficult to compare study results.

Interestingly, in times of burgeoning mobile and web-based applications (apps), which compete for market share, evaluative user satisfaction and effective-use studies have rarely been used to compare such artifacts, which could greatly help ongoing software engineering efforts in such markets. A few years ago, the TEDS framework and procedure was introduced [21] and successfully utilized in a number of empirical user satisfaction and effective-use studies, which also encompassed detailed artifact comparisons [10, 19, 20, 22].

While TEDS has demonstrated its effectiveness and analytical power in these studies leading to highly detailed and comprehensive results, it nevertheless also demonstrated its limitations with regard to the aforementioned constraints of high cost, heavy time commitment, and difficulties in user-rater/evaluator recruitment. In order to address and mitigate these three specific barriers, the researchers developed, introduced, and tested TEDSrate, a Web-based application (app), which allows recruiting and employing user-rater/evaluators anytime and everywhere. In this paper, TEDSrate, its uses, and the initial experiences with using it in evaluative studies, are presented and discussed.

The paper is organized as follows: In the next section, related work is reviewed leading towards the research question. Then, the design of TEDSrate is presented followed by the description of real-world pilot tests of the application. The results of the pilot tests are discussed, followed by the presentation of future work building on this discussion. The paper then concludes that frameworks/procedures like TEDS and supporting applications such as TEDSrate can effectively help conduct systematic user satisfaction and effective-use studies.

2 Literature Review

As mentioned before, determining and measuring the ultimate success of a software engineering project and its resulting artifacts has been a focus of debate for a long time. Already in the early 1980s fairly detailed categories had been specified for determining and assessing the relative value added by information systems regarding the specific contexts of their use and the respective information environment, in which they operate [25]. Later, the DeLone & McLean (D&M) model of information system success in its various evolutionary versions [7, 8] has served as a reference on a high level of abstraction in a number of SE-related fields and subfields [15, 23]. The D&M model basically relates three high-level variables of quality (information quality, system quality, and service quality) to equally high-level variables of system use (or, the intent of its use) and the user satisfaction, which in turn are said to lead to measurable or perceived net benefits, which feed back on system use and user satisfaction, the latter two of which are also connected via feedback [8]. Addressing these feedback relationships another recent study pointed at the importance of project efficiency, artifact quality, market performance, impact on stakeholders, and time as influential dimensions of software engineering success [11, 15]. Software engineering success along with overall information system or artifact success apparently depends on interacting and interdependent variables [9], which render the respective outcomes to factors not completely controllable by designers, developers, and project leaders.

As a result, multiple studies focused on better understanding these context-related factors and feedbacks. For example, recent workshops and studies emphasized user involvement in design and testing [3, 4, 14, 24]. Others highlighted the importance of continuous feedback on artifact (use) performance [1, 17]. Yet, others have relied on built-in monitoring and self-tuning functionalities as well as automatic user review scanning and salient-issue ranking methods [5, 6, 13]. Also, although not new, recent studies have reintroduced the utilization of personae and scenarios in both artifact design and artifact evaluation [2, 18].

However, the D&M model variables can hardly be studied in isolation, nor can they be effectively addressed when just employed on a high level of abstraction when it comes to design-relevant and artifact-specific recommendations (or comparisons). The TEDS framework and procedure [21], which represents a substantial extension to the aforementioned “Value-added Processes” work advanced in the 1980s [25], not only breaks down into detail the six high-level variables of the D&M model, but also accounts for the interaction between the variables within a given context by employing the concepts of personae and scenarios. The TEDS framework distinguishes six major categories of (a) ease of use/usability, (b) noise reduction, (c) quality, (d) adaptability, (e) performance, and (f) affection. These main categories are further broken down into 40 sub-categories further specifying and detailing the main categories. The TEDS procedure, then, specifies thirteen steps of evaluating what is called an “information artifact,” which, as a summary term, is used to represent any information technology or software artifact that a human actor may use for her or his purposes within a certain context. The term “information artifact” encompasses “both sources and pieces of information as well as information systems and other information technology artifacts” [20, p. 141]. The concept acknowledges that “information” is a context-dependent entity providing a certain meaning in the eyes of a beholder, and technology carrying and containing this very information can no longer sharply be distinguished from each other.

As mentioned, the TEDS framework and procedure has demonstrated its analytical power in various empirical studies [19, 20, 22], in which it was able to help derive detailed recommendations for developers and designers, and it also provided valuable competitive information to service providers who intended to improve their online offerings. However, while the results quite strongly proved the effectiveness and the overall concept of information artifact evaluation by means of the TEDS framework and procedure, it was still subject matter experts who had to carry out the detailed assessments and evaluations in a rather time-consuming and costly fashion [4] and also in geographically limited areas, all of which would present serious constraints for the future use of TEDS.

3 Research Question and Methodology

As a natural next step, the authors considered building a web-based tool for using the TEDS framework and procedure, which would reliably facilitate the issuance of artifact assessments and evaluations to both subject matter experts and laypersons alike on a broad and potentially global scale. With increasing sample sizes and controllably established demographics, it was reasoned that this would enable information artifact evaluations rather inexpensively while comprehensively at the same time. In the following, requirements, design criteria, and design options for a web-based tool enabling the use of the TEDS framework and procedure are discussed. This addresses the research question:

RQ:

What kind of Web-based tool can help subject matter experts and laymen alike perform TEDS-based evaluations capably and with global access?

3.1 Design Considerations

Overall Requirements: When analyzing how TEDS was “manually” used in projects of empirical information artifact studies, that is, when the projects followed the 13-step procedure as described elsewhere [21] without the use of information system technology (ICT) support, the authors identified functional and non-functional requirements of a to-be ICT-supported TEDS tool.

3.1.1 Functional Requirements

Rating/Evaluation Component: The TEDS tool had to be able to input, record, and display scale ratings (for example, on a 1–5 Likert scale) from human raters for up to six main categories and up to forty sub-categories of TEDS in a pre-specified number of scenarios and for a pre-specified number of personae. As part of the evaluation component the TEDS tool had to further be able to calculate and present/print average scale ratings per category/sub-category for each persona and scenario along with the standard deviation. Beyond recording numerical scale values the TEDS tool had to be able to record free-format text comments along with screenshots of a rated artifact for each category and sub-category in any persona-scenario couplet. Recording the ratings needed to occur in an IRB acceptable and human subjects protecting space along with online raters’ detailed demographic information. The TEDS tool report component had also to be able to pivot results along each dimension. It also had to be able to include raters’ comments and screenshots in reports. Rater-provided screenshots and comments had to be searchable/findable per artifact, scenario, persona, and rater.

Administration/Configuration Component: In order to make the TEDS tool usable for multiple projects and studies, a configuration tool was required; also, for the analysis of results an administration tool for projects and configuration was needed. The TEDS tool admin/configuration had to be able to freely configure categories and sub-categories (all, sub-sets, or extensions). It also had to be able to cluster and re-cluster sub-categories. The TEDS tool admin/configuration further had to be able to add, modify, and remove artifacts, scenarios, and personae. It had to be able to modify the descriptions of categories, sub-categories, and topical clusters. It had to be able to add, modify, and delete collected rating data. For use with external tools rating data and reports had to be exportable into CSV format. The export or handover to other utilities such as the R project for statistical computing had to be provided for post-processing of results.

3.1.2 Non-functional Requirements

For reaching out to expert and layman raters without geographical and time constraints, the TEDS tool needed to be Web-based and work on any Web browser. The browser-based user interface had to be easy to navigate and operate. For easy and straightforward rating and recording, the TEDS tool had to be able to display the information artifact to be rated without interfering with the artifact’s functionality alongside the rating tool in a browser window. Given the electronic mass recruitment of raters, for example, via Facebook advertisement, the rater population would be diverse, and so would be their devices and platforms. Consequently, TEDS tool had to be able to support a wide range of devices. The user interface of the TEDS tool had to be adaptable and adjustable depending on the artifact under evaluation, for example, for mobile applications versus web pages, or for full-blown TEDS evaluations versus subset evaluations. Demographic questions had to be configurable relative to the respective TEDS study design. Ratings were to be recorded instantaneously. Rating sessions were to be able to be temporarily suspended and resumed at a later point in time without the loss of data. Raters were to be informed about the progress of the rating exercise relative to completion. Rating results were to be searchable instantaneously. High standard deviations in ratings along with other outliers were to be made visible. Graphics and charts were to support the analysis of rating results. Finally, recruiting and signing up raters, conducting ratings, recording and storing large amounts of data were to be performed in a fashion allowing for comprehensive empirical studies with low or no budgets.

3.2 Design Criteria

When reviewing and considering the requirements, it quickly became clear that publicly available and generic tools such as Google Forms or SurveyMonkey were no suitable solutions for meeting IRB requirements and human subject protection needs and/or would carry prohibitively high price tags when signing up raters. Also, for the inaccessibility of respective data, statistical analyses on raters’ demographics would have required significant overhead when using those generic tools. Furthermore, some essential functionality along with the need for flexible and robust configurability options would not have been attainable with such publicly available tools. Consequently, the researchers decided to build a homegrown tool, which would meet all requirements including the storage of collected data on secure institutional servers. Moreover, it was reasoned that a homegrown tool would far better fit the flexibility and configurability needs of future TEDS-based empirical projects.

3.3 Design Options

When analyzing various (also alternative) tool design options, we ultimately settled on utilizing the LAMP (Linux, Apache, MySQL, and PHP) stack. In our reasoning, while LAMP was popular, cost effective, and open source, it also provided the advantages of known runtime robustness along with generally high performance, global resource and support bases, excellent documentation, and sustainability for future development. Along these lines the high potential for continued future talent recruitment from a vast pool of knowledgeable developers for this platform was another important argument in favor of LAMP.

Among other options considered were Windows as server platform, noSQL as database, .Net as alternative to PHP, and native code development as opposed to Web-based application (app) development. In each single area as well as for the whole platform, we concluded that LAMP was favorable. Windows as proprietary server platform appeared more costly in terms of available development resources, installation cost, and upgradeability/version sustainability. The enterprise-grade .NET framework seemed to be overkill relative to the foreseeable present and future research needs of the envisioned relatively small system, which were seen as fully covered via PHP, the latter of which also provided rapid prototyping and app development along with boilerplate constructions of Web-based application program interfaces (APIs). Also, we did not expect much server-side logic to be needed. As a result, we saw PHP as a right-size/right-weight choice. On the client side, we could have opted for developing a native application instead of using a Web-based application. However, this would have led to a proprietary and high load of custom development and maintenance along with portability issues among others, whereas a Web-based client would be easier to develop, maintain, and distribute. Finally, relational characteristics are a mainstay of TEDS-based use and usability studies so that a relational database concept was the natural choice over non-relational concepts. Among relational databases, MySQL had advantages of cost effectiveness, slimness, platform independence, robustness, and non-proprietariness over other options such as Microsoft SQL Server, Oracle, or others. In summary, the LAMP stack appeared as a logical platform for the development and implementation of the Web-based TEDS rating tool, which was dubbed TEDSrate.

3.4 The TEDSrate Approach

According to the functional requirements, TEDSrate would need three main architectural components: (1) an administration and configuration component, (b) a rating or evaluation/assessment component, (c) a database component for storing study configurations as well as evaluation results and ratings along with qualitative data such as comments and screenshots, and (d) a result query and presentation component (see Fig. 1). A fifth architectural component, that is, an automatic statistical post-processor was and still is under consideration for a future version of TEDSrate. In its current implementation, TEDSrate uses both plain php scripts and the object-oriented CodeIgniter (CI) PHP framework (Fig. 2).

Fig. 1.
figure 1

TEDSrate overview

Fig. 2.
figure 2

TEDSrate admin/configuration tool (project: sports mobile app comparison)

The Admin/Configuration Component allows to create and manage TEDS research projects. On the server side, a new project is started in the admin function by defining and attaching the project’s use facets such as artifacts, personae, scenarios, and roles. Several scripts handle project setup and management including adminproc.php (for admin login/logout), start.php (for handling the routing logic for new assessments and new users), assessment.php (a misnomer for legacy reasons, now containing an Angular template for issuing assessments), upload.php (for uploading rater screenshots and providing feedback to the raters), and welcome.php (for helping raters navigate configurations). In recent rewrites and updates to TEDSrate, CodeIgniter has been used as an efficient replacement method for previously used plain php models to interact with the data layer, since it also allows for the creation of a REST (representational state transfer) API, which is now the primary means of interacting with the database facilitating CRUD (create, read, update, and delete) operations on all entities of the data schema.

Furthermore, the Internal API handles specific processes such as receiving project overviews and generating report tables. On the client side, Admin.js is an Angular script, which supports the creation of project entities such as artifacts, scenarios, personae, roles, user interface configurations, and evaluations. Admin.js allows administrators to view rating results in the form of pivot tables presenting means and standard deviations. It further provides access to and graphically presents raters’ demographic information. Moreover, Admin.js presents statistics along three dimensions: artifacts across a scenario, scenarios for one artifact, and an evaluation across a user interface configuration. The former two statistics provide aggregate data for the respective variables, the latter allows the granular inspection of individual evaluations when checking for data consistency and quality.

The Evaluation/Assessment Component. Much of the evaluation and assessment component resides on the client side, which has also mostly moved from legacy plain Javascript components to the Angular application module Assessment.js, which represents the logic for rater evaluations. This module is used for evaluations by both expert raters and layman raters and contains functionalities such as auto-saving, progress tracking, re-routing in case of evaluation/evaluator-rater mismatch, and screenshot uploading with progress feedback. The module also accounts for the various user interface configurations on the client side.

The Database Component. The relational database (see Fig. 10 in the Appendix) contains tables for projects, artifacts, scenarios, personae, roles, and configurations. The latter serves as a container for four configuration types: attributes, assessment, questions, and user interfaces (UIs). It also provides an obscured ID in form of a hash, which allows raters to be added via the start.php script. Via attribute configuration, TEDS evaluation subsets can be configured (for example, instead of all forty sub-categories, only groups or clusters of categories/sub-categories can be selected for evaluation). The assessment configuration table specifies the key variables of the study, which are artifacts (usually a website or mobile app) and the scenarios, personas, and roles. The question configuration table serves as a target to associate the project with a group of survey questions. The UI configuration table contains the specification of the rating style (for example, Likert scale). The assessment table is the reference point for ratings, comments, and screenshots. It also holds time stamp information. The attribute table specifies the TEDS category/sub-category or configured cluster. It further holds the attribute description or explanation in academic or layman language. The rating table stores the rating value for a single attribute. It also serves as the reference to attach attribute-related textual rater comments and screenshots. The question table holds the information on demographics questions (question title/name, description, and requirement status), whereas the response table stores the respective rater responses. Finally, the user table stores personal identifiers such as email address, first name, last name, and password along with the respective users’ authorization level.

The schema also contains a number of associative entities such as project (parent), artifact, scenario, persona, role (children) or question (parent), project, artifact, scenario, persona, role, attribute (children).

Stored Procedures and Worked Scenarios. TEDSrate also contains about thirty stored procedures such as addPersona, addPersonaScenario, addProject, addProjectArtifact, addRating, addResponse, addScenario, addScreenshot, addUser, getAllArtifacts, getAllPersonae, getAllProjects, getCategories, getCriteria, getProject, getUser, updateCategory, and updateUser, among others.

Further, worked scenarios include starting a project, creating a configuration, and running a report.

4 Pilot Tests with Real-World Organizations

Concurrently, two TEDSrate-based evaluations of different artifacts were carried out, one of which in the environment of professional disaster response management at the City of Seattle’s Emergency Operations Center (EOC), and the other with a major league soccer club (Seattle Sounders FC). In the case of the Seattle EOC a Web-based artifact was evaluated, which responders mainly work with on desktop computers during the response to an emergency or a disaster. In the other case, a mobile application was rated, which ticket holders, fans, and supporters of the Sounders FC franchise use to keep up to date about their team and to shop for franchise-related merchandise or tickets.

4.1 Government-Internal Website Evaluation (WebEOC)

Intermedix’ WebEOC® is a Web-based application suite, which is tailored to help Emergency Operations Centers (EOCs) manage the response to and early recovery from disasters. The suite is configurable and expandable and enjoys a relatively large user base among EOCs in the United States. In recent years WebEOC has been criticized for its cumbersomeness, complexity, and old-fashioned user interface.

The City of Seattle’s EOC had a vested interest in identifying the exact problem areas of WebEOC from a user’s perspective, that is, from a disaster responder’s view. TEDSrate was configured and used to receive ratings and feedback from responders who had recently used WebEOC during a disaster response or exercise.

In particular, four scenarios of utilization, each of which comprises one or more use cases, were seen as potentially in need of improvement along several lines (UI, performance, logic, etc.). The four utilization scenarios were (1) Signing into WebEOC, (2) Lookup EOC Personnel on duty, (3) Document Your Section’s Staffing, and (4) Gain General Situational Awareness. The evaluation was carried out before and immediately after a major exercise was conducted involving over 200 responders in June 2016. The purpose of the evaluation was explained to responders on the entry screen (see Fig. 3).

Fig. 3.
figure 3

TEDSrate configurable entry screen

It is noteworthy that except for the introductory information in the entry page no further training of tool or method was required for responders to perform the requested evaluations for the four scenarios. The evaluation would be taken on a split screen, that is, the WebEOC artifact alongside the TEDSrate window.

4.2 APP Evaluation (SoundersS FC’s Mobile IOS APP)

Almost every franchise in the US Major Soccer League (MLS) has implemented a mobile application for smart phones or notepads. While the websites of all franchises are designed, operated, and maintained by the League, the franchises have greater leeway to develop and implement their own mobile apps. The various MLS team websites are distinct in appearance (logos, team colors, etc.) and content (team-related information); however, they are uniform in terms of functionality and style guidelines. When it comes to mobile apps, the League appears to mandate only the adherence to guidelines of presentation style and merchandising, whereas the functionality of apps may widely differ between franchises.

Since its introduction to the League in 2009, Seattle Sounders FC has developed into a commercially highly successful MLS franchise with the far highest average attendance in the League (44,247 in 2015), which is more than double the League’s average (21,574 in 2015), and even exceeds the average attendance of the league with the highest attendance worldwide, that is, the German Bundesliga (43,177 in 2015) [12, 16].

A comprehensive TEDSrate-based evaluation of an early version of the second generation of the Sounders FC’s mobile iOS app was conducted at a time, when the app development process had not concluded and was still open to extensions and modifications based on the evaluation results. The evaluation was performed in two rounds, first with expert raters who had been involved in a larger study, which had compared the mobile apps of a total of eleven leading professional soccer teams worldwide. The results of this separate study have been published elsewhere. These expert raters also evaluated the early second-generation mobile app of Sounders FC following the 13-step TEDS procedure in the traditional fashion without the support of TEDSrate. By mid-2015, the Sounders FC franchise agreed to collaborate with the research team upon organizing a TEDSrate-based evaluation of the second-generation mobile iOS app with the aim of incorporating the results of both experts’ ratings and TEDSrate-based ratings in the further development of the app. Via targeted advertisements on Facebook “layman” raters were recruited who would then be directed to the TEDSrate evaluation site and asked to rate the second-generation Sounders FC mobile iOS app. As in the case of WebEOC evaluation the “layman” raters would not receive any particular introduction nor training other than interactively available from the TEDSrate website. As intended the Facebook recruitments of “laymen” raters provided a wide spread of geographical, age, gender, and other backgrounds in the sample.

4.3 Demographics Module

When moving from purposively selected expert raters to a wider population of non-expert (“layman”) raters it was imperative to collect demographic data in order to better quantify and qualify the results. More detailed and more specific demographic data would be needed for larger populations (for example, “Asian soccer fans,” “North American soccer fans,” or “European soccer fans”, see Fig. 5) than for smaller and more homogeneous populations such as “City of Seattle Emergency Responders” when making sense of and relating the rating results to demographic characteristics in the analysis phase.

As mentioned before, demographic questions are configurable accounting for larger and diverse populations.

4.4 The Rating Procedure

TEDSrate allows for configuring and adjusting the categories and sub-categories of the TEDS framework. As mentioned before, the framework consists of six main categories and forty sub-categories, which can be expanded or consolidated depending on the desired granularity of the specific evaluation project. In the case of “layman” evaluations fewer and consolidated categories/sub-categories serve the evaluation purpose more effectively than too specific and too detailed rating schemes, which typically only experts fully understand and then rate in an informed fashion. We are referring to “experts,” in the context of TEDS, as individuals who have attended a TEDS framework and procedure training and, after completing an artifact rating, have also attended an inter-rater validity and consistency checking session (Fig. 4).

Fig. 4.
figure 4

Sample demographic questions (configurable)

In the case of the WebEOC website evaluation as well as in the case of the “layman” evaluation of the second-generation Sounders FC mobile app a consolidated framework was used, which was reduced to twelve sub-categories (two for each main category—see sample screen in Fig. 5), whereas the expert evaluation of the mobile app used the entire framework of forty sub-categories.

Fig. 5.
figure 5

Sample rating screen with Likert scale, free-format text comments, and screenshots (configurable)

Transparent to the individual rater who uses the rating tool TEDSrate saves all data entries immediately via AJAX calls to the server. Each entry, whether it is a Likert scale radio button tick, a text comment, or an artifact screenshot is saved individually, so that client-to-server communications are relatively small and therefore fast.

Whatever configuration is used, the rater sees her advancement towards completion of the evaluation by means of a progress bar displayed at the bottom of the rating screen.

If raters have to postpone the completion of the evaluation for some reason, they find the latest data they had entered before pre-filled in the form, so that they can continue the rating exactly at the point, where they left it off.

Most artifacts are designed to serve multiple purposes and subsequently are used in practice in more than one scenario of utilization. However, the evaluation with TEDSrate has to distinguish between scenarios, since an artifact might be highly rated for some uses and certain scenarios, while it may fall short in others.

As an example, for the mobile apps of soccer clubs such as Sounders FC, Real Madrid, of FC Barcelona, the scenarios of “player information” and “schedule and results” might be evaluated among others. A rater, hence, has to go through the rating procedure as many times as separate scenarios were configured for evaluation. Once one scenario evaluation is completed, the rater needs to be reminded that other scenarios still need ratings. Once raters complete or leave a rating session unfinished, upon exiting the rating of a scenario, they are reminded of the overall completion status of their assignments (see Fig. 6).

Fig. 6.
figure 6

Survey completion update

In evaluation assignments with several pre-configured scenarios or attributes, TEDSrate also allows for the randomization of the order, in which the various scenarios or attributes are presented to the rater.

4.5 Mitigating Rater Fatigue

In the course of both artifact evaluations, the WebEOC website and the Sounders FC mobile app, rater fatigue was discovered. Some “layman” raters would leave the rating sessions behind incomplete even after repeated reminders. While the randomization of the order of assignments appeared to have already had some mitigating influence on rater fatigue, other means such as incentives were considered and became part of the TEDSrate tool during the practice test phase. In particular, when populations with potentially short attention spans are targeted, the incentive module can be configured. The implementation was performed in the format of a lottery, in which raters who completed the assignments would earn them “tickets” with certain material value, which could then be used for purchases or other benefits. In the case of the Sounders FC’s mobile app, the lottery-based mitigation strategy worked satisfactorily leading to much increased completion rates. The researchers also successfully experimented with giving out $5 gift certificates to the first 25 raters who completed the TEDS surveys for two scenarios by using timestamp and user ID information. Likewise this led to more and faster completion of surveys in this particular pilot.

4.6 Presentation and Analysis of Results

In both pilot tests, the feature of the TEDSrate Admin utility, which lets the researchers track evaluations and lets them see even preliminary results in real time while the evaluations are still underway, was found highly informative and beneficial. All analytical functions can be performed this way, for example, inspecting pivot tables of ratings along the lines of configurations, scenarios, or artifacts, or after the evaluation project has ended. The utility also allows for selection and instantaneous analyses of demographic sub-samples, comment presentations, screenshot inspection, and data export to external analysis tools (for an example, see Fig. 7).

Fig. 7.
figure 7

Likert ratings for two scenarios along twelve sub-categories for the sounders FC mobile app

The visualization and formatting of results was found essential for analytical interpretation, also due to the sheer amount of detailed data, which was produced. Not only numerical data were target of visualization and formatted display but also comments, screenshots, and demographic information helping focus the analytical treatments and speed up the overall analysis process (for example, see Fig. 8). In ongoing rating campaigns the immediacy of information availability, in particular, with regard to demographic information helped target the rater recruiting so that the various identified personae could exactly be represented and matched by the sample of raters. Formatted displays for comments and screenshots supported the straightforward inspection of data and their analytical interpretation. When numerical data showed both relative strengths and weaknesses in a particular area, for example, “navigation and findability” in the scenario of “player information,” then the comments and screenshots, which raters had provided, could be inspected in that particular area (see Figs. 7 and 9).

Fig. 8.
figure 8

Visualization of usage frequencies in support of interpreting the weight and validity of ratings

Fig. 9.
figure 9

Inspecting raters’ comments and screenshots in a target area based on clean and formatted displays

5 Discussion

As shown in the section on related work above, software engineering success depends on a number of interacting and interdependent variables, some of which escape the developers’ span of control, whereas others, which can be directly influenced, have so far gone unattended for the most part due to prohibitive cost and overwhelming commitment of resources and time needed to uncover deficiencies in, for example, artifact quality, attractiveness, user satisfaction, and system use among others.

Feedback, if any, which could practically and effectively influence how developers and designers tweak or reshape an artifact to better meet expectations and needs, would be slow in coming and probably incomplete. While the TEDS framework and procedure might be the most comprehensive and systematic analytical lens available for assessing, evaluating, and comparing artifacts, it also suffered from the high cost incurred, long time to conclude, and heavy resource commitment necessary in order to arrive at detailed, conclusive, and robust results. In many instances, however, even if such a level of effort had been expended, it would not have produced the needed feedback in due time, and, for example, market opportunity might have already vanished, or worse, damage had already been inflicted. The critical question then became how the prohibitive high cost, long turnarounds, and excessive resource commitments for systematic artifact evaluations could be cut down without compromising the validity and robustness of results. This led the research group to consider, specify, design, develop, and test TEDSrate in practice.

The tool underwent two real-world tests, one with the City of Seattle Emergency Operations Center (EOC) for a desktop-operated web-based application suite (WebEOC), which serves as the Center’s linchpin in disaster response. The other real-world test was simultaneously conducted with Seattle Sounders FC for the soccer franchise’s mobile application, which is the centerpiece of interaction between the club and its supporters and match attendees.

These two tests greatly demonstrated the effectiveness and utility of the tool, which produced robust and reliable results, which were used by both organizations to make targeted changes to the configuration of their respective artifacts. In the case of Sounders FC, the test identified in fine detail such areas that needed improvement. Moreover, informed by rater comments and screenshots and through pinpointed comparisons with other “best-in-class” implementations, detailed design recommendations were given to the mobile app developers, many of which have meanwhile been developed and implemented into version 2 of the Seattle Sounders FC mobile app.

The two tests were conducted over a period of six weeks. A total of 90 raters were involved, most of whom completed all Web-based TEDSrate surveys in all scenarios, to which they were assigned. The recruiting of “laymen” raters was found easier, when certain material incentives were offered, for example, gift cards. Recruiting raters for the Seattle Sounders FC app via the Sounders’ Facebook site by means of targeted Facebook advertisement was straightforward. In the case of WebEOC, the raters were recruited via EOC-internal email invitation. However, in other artifact evaluation and comparison studies, different recruiting approaches may also be effective.

Since TEDSrate works web-based, the reach of this artifact evaluation and comparison tool is global, so that literally any target audience can directly be reached. Results of TEDSrate-based artifact evaluations and comparisons become available instantaneously, which provides a great benefit also to developers if TEDSrate is used in pilot testing and iterative development cycles. The tests proved that time was little, cost was low, and resources were few that were needed to produce detailed artifact evaluations and real-world feedback.

These results give us confidence for asserting that TEDSrate has successfully addressed a core issue when it comes to improving and enabling timely and effective artifact evaluation.

6 Conclusion and Future Work

Software engineering success hinges on a number of variables, not all of which developers and software engineers are able to directly influence. However, many of those that can be directly addressed have also gone unattended for reasons of high cost, long time to complete, and prohibitive resource commitments necessary for producing meaningful and detailed feedback on artifacts. With the introduction of TEDSrate a tool has been created and tested that overcomes the cost, time, and resource barrier. It helps collect, analyze, and present detailed feedback data, which can immediately be used to adjust and change designs and improve artifacts.

In the next version of TEDSrate we will implement a post-processor, which transfers the numerical data to statistics packages for appropriate automatic analyses. We also consider the transfer of comments to an automatic text-mining post-processor.