Keywords

1 Introduction

Autism Spectrum Disorder (ASD) is a lifelong neurodevelopmental disorder that is routinely screened for in children as young as 18 months using gold-standard clinical instruments such as the Autism Diagnostic Observation Schedule, Second Edition (ADOS-2) [1]. Early detection followed by targeted intervention has been shown to yield meaningful improvements in outcomes for individuals with ASD [2,3,4]. Despite the potential of early intervention to curtail developmental delays, constrained clinical resources and barriers to access for some populations prevent many families from obtaining these services [5, 6]. For example, evidence from the CDC indicates disproportionate identification of ASD along racial and ethnic lines with non-Hispanic white children being more likely to access early screening than both non-Hispanic black children (by 30%) and Hispanic children (by 50%) [7]. Even when diagnostic services are available, constrained clinical resources can lead to substantial delays in diagnosis resulting in lost opportunities for early intervention [8].

Attempts to address this issue of accessibility have produced a rich body of research on approaches to clinical screening as well as commercial products for early detection. A number of screening and diagnostic instruments have been developed, including the Modified Checklist for Autism in Toddlers (M-CHAT; [5]), the Autism Diagnostic Interview-Revised (ADI-R; [9]), and the previously mentioned ADOS-2 [1]. Recent commercial products have also been developed with aims of broadening access to early screening (e.g., Cognoa; [10]) and translating paper-based instruments to an easier-to-use digital platform (e.g., CHADIS; [11]). While each of these solutions directly addresses a number of the barriers to early screening and diagnosis, existing methods may not fully optimize analytical and technological utilities that can be used to conduct brief, simple, and accurate screening procedures for use by expert and non-expert administrators alike.

In response to the limitations of existing approaches, we have developed a tablet-based ASD screening tool called Autoscreen that uses machine learning (ML) methods and a data-driven design to effectively triage toddlers with ASD concerns based on an engaging and non-technical administration procedure. In this paper, we present the design of the novel system as well as preliminary evaluations of usability and acceptability by expert clinicians. We hypothesized that (a) ML algorithms could be developed to stratify risk according to binary labels with acceptable accuracy and (b) the novel system would be judged favorably on measures of usability and acceptability. The remainder of this paper is structured in the following manner: Sect. 2 discusses related work, Sect. 3 details the design of the software including the user interface and analytics modules, Sect. 4 presents the design of a preliminary evaluation of the novel application with expert clinical users, Sect. 5 gives preliminary evaluation data, and Sect. 6 provides a discussion of the results and concludes with a summary of the current work’s contributions, limitations, and planned future work.

2 Related Work

Some of the literature has highlighted the effects of age on completion of an autism screening test, such as the M-CHAT [5]; higher failure rates for M-CHAT tasks are more likely to be due to developmental immaturity rather than due to the presence of ASD symptoms [11]. Another study found that a two-tiered screening process, using both the M-CHAT and the Screening Tool for Autism in Toddlers and Young Children (STAT), improved early identification of young children who were at risk for ASD [12]. ML techniques have also been used to both broaden the reach of ASD evaluation of at-risk populations and to improve upon widely-used ASD screening and diagnostic tools [13,14,15]. Specifically, the use of ML algorithms has played a great role in identifying a small number of attributes (or “features”) of the tested items in ASD screening tools, such as ADOS-2, and making diagnoses with a high degree of accuracy [13, 14]. However, when applying ML algorithms to autism diagnostics, it is necessary to avoid certain pitfalls such as assuming the ready availability of numerous features from gold-standard instruments [14]. Indeed, this was an important consideration in the current work to use ML to identify a minimal feature set.

In addition to sophisticated paper-based instruments, digital applications have been developed to assess ASD risk. The previously mentioned Child Health and Development Interactive System (CHADIS) is a web-based platform for administering commonly used paper-based screeners and assessments in a digital format to a variety of populations, such as children with ASD and adolescents with eating disorders [11]. Although CHADIS supports an extensive catalog of ASD-centric instruments, it does not have the specific focus on ASD risk assessment in toddlers based on a very brief (i.e., 10–15 min) interaction with minimal training which is available in the current work. Another related application is Cognoa which consists of a mobile application that provides a variety of screening and assessment utilities for individuals with concerns related to ASD and ADHD, among others [10]. While the technical elements of Cognoa are novel (e.g., child behavior analysis through uploaded video), like CHADIS, it may not facilitate rapid administration in constrained clinical settings. Furthermore, the availability of these tools does not guarantee uptake in clinical practices. Thus, there remains a need for a tool capable of providing ASD risk assessment in a manner that is brief, easy to learn and use, and accurate in its labelling of risk categories.

3 Autoscreen Software Design

Autoscreen, the system presented in this paper, is a tablet-based mobile application for ASD screening that uses ML algorithms to effectively separate toddlers into two categories of risk for ASD: namely, high and low. This system is designed to guide non-expert test administrators—who may be clinicians, primary care providers, or parents—through a series of play-like procedures intended to reveal characteristics of toddlers that are indicative of ASD risk. Two distinct software modules—a simple user interface and an analytics module for data processing—were designed to achieve this goal and are described in detail in Subsects. 3.1 and 3.2, respectively. The Autoscreen system was developed initially for Android devices (version 4.1 “Jelly Bean” and higher) using Unity 2017.3.0f3, a cross-platform 3D game engine, and the C♯ programming language, but Unity also supports seamless deployment to other environments including desktop operating systems.

3.1 User Interface Design

The Autoscreen system is intended to be an easy-to-use, accessible, and understandable screening tool for non-therapists or other untrained users. In order to create such a system, a wide range of design features needed to be considered, primarily application complexity, navigation, display arrangement, and intuitive interaction. The entire Autoscreen application can be logically divided into five subsections: landing page, tutorial, screening activities, scoring form, and risk assessment. The “screening activities” subsection can be further expanded into subject information entry, materials checklist, and tasks administration. Using the application, from start to finish, is largely a linear process.

Autoscreen is intended to run primarily on mobile devices, and each of the application pages must adhere to design choices that facilitate such use. Although pages in Autoscreen may have different purposes, according to their respective subsection, they each build off of the same template (see Fig. 1). Each page displays informative features such as a descriptive header, buttons for playing audio tips, buttons for page navigation, and a navigation pane. The header and the audio tips buttons serve both to inform the user of the purpose of each page of the application and to offer an aural walkthrough of each task presented. The navigation buttons allow the user to move between neighboring pages while the navigation pane provides the user with a convenient reference to the present stage of task completion. The navigation pane is presented vertically on the left-hand side of the screen; each of the subsections is represented by a small representative icon. For example, a checkbox was used to represent the material checklist page, and a cinema camera was used to represent the tutorial page, which features descriptive videos.

Fig. 1.
figure 1

A series of screenshots from the Autoscreen prototype: (A) the application landing page; (B) a page providing a list of video materials for training; (C) a page showing the instructions for a “Free Play” activity designed to elicit social behaviors over a period of two minutes; and (D) a page showing a timer associated with the “Free Play” activity.

While the aforementioned design layout is highly consistent throughout the application, each of the specific subsections make use of different features. The tutorial page features video guides on application use, where touching a specific play button icon presents the user with detailed video examples of various application functionalities. The screening activities subsection is the most intensive portion of the Autoscreen application. In the first step of screening activities, subject information is entered by the user. In an effort to make data entry on a mobile device as convenient as possible, conventional widgets such dropdown menus, radio buttons, and text fields were used. For example, the subject’s date of birth is fully selectable through dropdown menus and a unique subject ID can be specified via a text field. The next stage of the screening activities process involves the material checklist. The user is provided with a list of materials that must be acquired before continuing with the tasks; relevant items include a ball, matchbox cars, dolls, and snacks. The final segment of the screening activities subsection is the task page. Tasks are defined as either one-step or two-step. One-step tasks are untimed and display instructions and materials involved in the task. Two-step tasks consist of instructions and materials, but are also timed, and, as such, contain both an information page and a timer page (see Fig. 1C and D). The timer page prominently displays the time remaining across the middle of the screen. The two final subsections are the scoring form page and the risk assessment pages. The scoring form page provides radio buttons for evaluating a child’s performance on the previous tasks on a 3-point scale. The attributes on which the child is scored were derived from the analyses discussed in the next section. Finally, the risk assessment page provides the user with a summary of risk classification based on the scores entered by the user. The user is provided with a quantitative score of risk, a classified risk status (i.e., high or low), and a listing of specific qualities identified as concerns and another list identifying strengths.

3.2 Analytics Module

Autoscreen embeds machine learning algorithms that were developed outside of the Unity environment. Using the ML toolkit scikit-learn [16], a variety of binary classification models were trained and evaluated using a 70–30 train-test split approach. The weighted F1 score was used as the primary metric for the evaluation of model performance because, unlike accuracy, it is robust to issues arising from imbalanced data. The data used for model training were obtained from a database of clinically-verified diagnoses of toddlers (aged 18–30 months) collected at the Vanderbilt Kennedy Center. This dataset was ideally suited for supervised ML due to its high-quality feature set, labelled binary structure (i.e., “ASD” and “Non ASD”), relatively large size (N = 737 examples), and relatively balanced makeup (i.e., 69.74% ASD versus 30.26% Non ASD). The features of this dataset included codes from a variety of clinical instruments including the ADOS-2 (Toddler Module), Mullen Scales of Early Learning (MSEL), and Vineland Adaptive Behavior Scales, Second Edition (Vineland-II). As in related work [14], we chose to explore model development focusing on features derived from ADOS-2; exploration of MSEL and Vineland-II features will be pursued in future work. Analysis of ADOS-2 codes contributed to the identification of the key dimensions of child behaviors that could potentially be teased out using short and engaging procedures designed in our novel screener.

It is important to draw a clear distinction between the features used in the ML analyses and the features ultimately used by Autoscreen in the prediction of ASD risk. The features in the ML analyses were originally derived from long, formal diagnostic interviews conducted by expert clinical administrators and it is not possible to obtain precisely equivalent features during a brief assessment. The input parameters obtained within Autoscreen, on the other hand, were designed to quantify a minimal set of characteristics indicative of ASD risk through a brief but efficient interaction. This minimal set of characteristics was selected through a process of feature selection conducted on the ADOS-2 codes using a variety of ranking methods which included information gain and the χ2 measure of fitness. This process revealed a common subset of seven core attributes—two related to communication, two related to reciprocal social interaction, and three related to restricted and repetitive behaviors—which were used to devise the screening activities and a new set of coded attributes in Autoscreen. Using the previously described 70–30 split design, predictive models were trained using a variety of simple supervised classification models. The Naïve Bayes model using default parameters yielded the strongest performance with an F1 score of 0.94 (accuracy = 91.4%). The k-Nearest Neighbors (k = 5, uniformly weighted) model demonstrated comparable performance with an F1 score of 0.93 (accuracy = 91.0%), as did the Logistic Regression model using default parameters with an F1 score of 0.93 (accuracy = 90.0%). From these results, it was clear that even a simple model comprised of only a handful of features could produce a strong ASD risk classification algorithm for practical screening. However, future work will be required to properly determine the accuracy of Autoscreen’s risk assessment algorithm.

4 Preliminary Evaluation of Usability and Acceptability

While accuracy of risk assessment will be critical to the success of Autoscreen, other major aspects of the application must also be carefully considered in order to maximize adoption likelihood. Two other major considerations described here are usability and acceptability. While these terms are somewhat subjective and task-specific, a wide range of researchers have found use in quantifying users’ perspectives of technologies along these dimensions [17,18,19]. In the context of this work, usability is intended to quantify users’ impressions of the general likability of the application with respect to user interface design, report content, and task presentation. The measure of acceptability, on the other hand, is intended to quantify users’ attitudes concerning the appropriateness, likely effectiveness, and feasibility of the proposed application. Stated simply, usability is about the mechanics of the user experience while acceptability gets at the broader question of whether this technology will be useful in the real world. In this paper, usability is measured using the 10-item System Usability Scale (SUS) which employs a 5-point Likert-type scale [17], while acceptability is measured using a new instrument devised for this study which we have dubbed the Acceptability, Likely Effectiveness, Feasibility, and Appropriateness Questionnaire (ALFA-Q). The 10 items of the ALFA-Q were adapted from two different questionnaires from the literature on acceptability of intervention protocols; specifically, items were adapted from the Intervention Rating Profile for Teachers (IRP-15; [19, 20]) and the Treatment Acceptability Rating Form—Revised (TARF-R; [18, 21]). Although both the IRP-15 and TARF-R use a 6-point Likert-type scale, for consistency with the SUS, ALFA-Q also employs a 5-point scale with rating labels equivalent to those of the SUS. Table 1 provides the details of the ALFA-Q.

Table 1. Acceptability, likely effectiveness, feasibility, and appropriateness questionnaire (ALFA-Q)

4.1 Participants

Eight volunteers provided feedback concerning elements of usability and acceptability with regards to the Autoscreen application. Because the primary goal of this study was to identify weaknesses of the application that could be corrected prior to later evaluation with the target cohort (i.e., toddlers and screener administrators), a small convenience sample of volunteers was recruited. This sample included two individuals with expertise in ASD diagnostic procedures and six individuals without such expertise, but with at least some awareness of common diagnostic tools such as ADOS-2 or STAT. All participants provided informed consent prior to engaging in study procedures, and these procedures were approved by the university’s Institutional Review Board.

4.2 Procedures

The Autoscreen application was loaded onto two ASUS ZenPad S 8.0 tablets running Android 6.0 (“Marshmallow”) and a shortcut to the application was placed on the first tablets’ home screens. Participants were asked to use Autoscreen to walk through the procedures of a simulated assessment. As stated above, the focus of this initial study was to identify areas for improvement with regards to measures of usability and acceptability. As such, the application was not used to evaluate ASD risk among children at this stage. Once launched, Autoscreen automatically guides the user through the steps of task administration including entry of subject information (i.e., regarding the child being assessed), assessment activities, scoring the internal form, and reviewing the system-generated risk report. After completing the task procedures, participants provided anonymous responses on both the SUS and ALFA-Q while Autoscreen internally logged user inputs and event data. Once each of the participants had completed the task procedures, the researchers compiled all of the data for analysis.

4.3 Measures

The primary study measures for this work were the composite scores of the SUS and ALFA-Q. The SUS composite score is computed on a scale ranging from 0 (i.e., lowest rating of usability) to 100 (i.e., highest rating of usability). The ALFA-Q composite score also ranges from a minimum of 0 (i.e., lowest rating of acceptability) and a maximum of 100 (i.e., highest rating of acceptability), and is computed using Eq. (1)

$$ score = - 25 + \frac{5}{2}\sum\nolimits_{i = 1}^{10} {a_{i} } $$
(1)

where, \( a_{i} \) is the value of the ith item of the ALFA-Q. Because the ALFA-Q score is introduced in this paper as an exploratory measure of acceptability, its interpretation should be considered cautiously within the context of this paper; future work is required to refine the ALFA-Q instrument. A secondary measure of interest was the task completion time because it provides an indication of the time demand of task procedures. Because a primary goal of Autoscreen is to make ASD risk screening both fast and convenient, task completion time should be as low as possible. Lastly, some participants also provided open-ended feedback, which is discussed in the following sections.

5 Results

Participant responses to survey items are given in Table 2. Responses were largely favorable on both the SUS (M = 87.19, SD = 12.28) and ALFA-Q (M = 85.94, SD = 13.02), although the variation in scores across participants suggests areas for improvement. An item-level analysis of the SUS revealed that the lowest rated item was item two, which describes the degree to which the system is “unnecessarily complex” [17]. The item with the poorest performance on the ALFA-Q was also item two, which describes the appropriateness of the application for individuals at a variety of positions on the autism spectrum, perhaps suggesting that Autoscreen—in its current form—may be best suited for comparatively higher-functioning individuals. With regards to task completion time, participants completed the entire set of administration tasks with a mean time of 8.72 min (SD = 4.16). Although promising, the reader is reminded that this represents only an approximation of a lower bound on task administration time. Participants also provided open-ended feedback with regards to aesthetic elements of the application, including content layout, button sizes, font selection, etc. The data obtained in this preliminary work will be used to improve Autoscreen prior to its evaluation with children. The reader is reminded that the results observed in this preliminary study are based on use of the application outside of the use case ultimately intended (i.e., to assess ASD risk of a child via the administration of a structured interaction). Again, the purpose of the current study was to evaluate the feasibility of the proposed system by measuring aspects of usability and acceptability.

Table 2. Participant responses on the SUS and ALFA-Q

6 Discussion and Conclusion

The feasibility of Autoscreen appears to be supported by the feedback received from participants. With regards to usability, the application was rated favorably with scores on the high end of the scale (i.e., 87.19). However, this score was obtained from users who were not actively administering study procedures with a child, and thus should be interpreted conservatively. Despite this limitation, however, Autoscreen can be improved using the information obtained on the SUS; specifically, noted user concerns about system complexity can now be addressed before the application is used in actual therapist-child evaluations. Similarly, ALFA-Q scores indicate favorability of the system (i.e., 85.94), and also bring to light areas in which Autoscreen may be improved or extended. For instance, the applicability of Autoscreen across the autism spectrum may be achieved by constructing a dynamic set of assessment activities according to child ability, rather than using a one-size-fits-all approach as in the current iteration. The brevity of administration procedures (i.e., 8.72 min) is in line with our goal of delivering a 10–15 min interaction, but requires evaluation of the target population for confirmation. Additionally, open-ended feedback concerning specific UI elements—such as size, color, and style of font as well as location of buttons and text on the page—will be used to refine Autoscreen prior to evaluation with children. Future work with Autoscreen will include the evaluation of the system in a cohort of toddlers with and without concerns for ASD to gauge the accuracy of the predictive models.