Vision-Based Classification of Developmental Disorders Using Eye-Movements

Pusiol, Guido; Esteva, Andre; Hall, Scott S.; Frank, Michael; Milstein, Arnold; Fei-Fei, Li

doi:10.1007/978-3-319-46723-8_37

Guido Pusiol¹⁸,
Andre Esteva¹⁹,
Scott S. Hall²⁰,
Michael Frank²¹,
Arnold Milstein²² &
…
Li Fei-Fei¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 9901))

Included in the following conference series:

International Conference on Medical Image Computing and Computer-Assisted Intervention

14k Accesses

Abstract

This paper proposes a system for fine-grained classification of developmental disorders via measurements of individuals’ eye-movements using multi-modal visual data. While the system is engineered to solve a psychiatric problem, we believe the underlying principles and general methodology will be of interest not only to psychiatrists but to researchers and engineers in medical machine vision. The idea is to build features from different visual sources that capture information not contained in either modality. Using an eye-tracker and a camera in a setup involving two individuals speaking, we build temporal attention features that describe the semantic location that one person is focused on relative to the other person’s face. In our clinical context, these temporal attention features describe a patient’s gaze on finely discretized regions of an interviewing clinician’s face, and are used to classify their particular developmental disorder.

You have full access to this open access chapter, Download conference paper PDF

Computer-aided autism diagnosis based on visual attention models using eye tracking

Article Open access 12 May 2021

Applying Eye Tracking to Identify Autism Spectrum Disorder in Children

Article 10 August 2018

Intelligent Eye-Tracking for the Early Diagnosis of Autism: A Mental Health Disaster with Families

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Autism Spectrum Disorders (ASD) is an important developmental disorder with both increasing prevalence and substantial social impact. Significant effort is spent on early diagnosis, which is critical for proper treatment. In addition, ASD is also a highly heterogeneous disorder, making diagnosis especially problematic. Today, identification of ASD requires a set of cognitive tests and hours of clinical evaluations that involve extensively testing participants and observing their behavioral patterns (e.g. their social engagement with others). Computer-assisted technologies to identify ASD are thus an important goal, potentially decreasing diagnostic costs and increasing standardization.

In this work, we focus on Fragile X Syndrome (FXS). FXS is the most common known genetic cause of autism [5], affecting approximately 100,000 people in the United States. Individuals with FXS exhibit a set of developmental and cognitive deficits including impairments in executive functioning, visual memory and perception, social avoidance, communication impairments and repetitive behaviors [14]. In particular, as in ASD more generally, eye-gaze avoidance during social interactions with others is a salient behavioral feature of individuals with FXS. FXS is an important case study for ASD because it can be diagnosed easily as a single-gene mutation. For our purposes, the focus on FXS means that ground-truth diagnoses are available and heterogeneity of symptoms in the affected group is reduced.

Maintaining appropriate social gaze is critical for language development, emotion recognition, social engagement, and general learning through shared attention [3]. Previous studies [4, 10] suggest that gaze fluctuations play an important role in the characterization of individuals in the autism spectrum. In this work, we study the underlying patterns of visual fixations during dyadic interactions. In particular we use those patterns to characterize different developmental disorders.

We address two problems. The first challenge is to build new features to characterize fine behaviors of participants with developmental disorders. We do this by exploiting computer vision and multi-modal data to capture detailed visual fixations during dyadic interactions. The second challenge is to use these features to build a system capable of discriminating between developmental disorders. The remainder of the paper is structured as follows: In Sect. 2, we discuss prior work. In Sect. 3, we describe the raw data: its collection and the sensors used. In Sect. 4, we describe the built features and analyze them. In Sect. 5, describe our classification techniques. In Sect. 6, we describe the experiments and results. In Sect. 7 we discuss the results.

2 Previous Work

Pioneering work by Rehg et al. [12] shows the potential of using coarse gaze information to measure relevant behavior in children with ASD. However, this work does not address the issue of fine-grained classification between ASD and other disorders in an automated way. Our work thus extends this work to develop a means for disorder classification via multi-modal data. In addition, some previous efforts in the classification of developmental disorders such as epilepsy and schizophrenia have relied on using electroencephalogram (EEG) recordings [11]. These methods are accurate, but they require long recording times; in addition, the use of EEG probes positioned over a participant’s scalp and face can limit applicability to developmental populations. Meanwhile, eye-tracking has long been used to study autism [1, 7], but we are not aware of an automated system for inter-disorder assessment using eye-tracking such as the one proposed here.

3 Dataset

Our dataset consists of 70 videos of an clinician interviewing a participant, overlaid with the participant’s point of gaze (as measure by a remote eye-tracker), first reported in [6].

The participants were diagnosed with either an idiopathic developmental disorder (DD) or Fragile X Syndrome (FXS). Participants with DD displayed similar levels of autistic symptoms to participants with FXS, but did not have a diagnosis of FXS or any other known genetic syndrome. There are known gender-related behavioral differences between FXS participants, so we further subdivided this group by gender into males (FXS-M) and females (FXS-F). There were no gender-related behavioral differences in the DD group, and genetic testing confirmed that DD participants did not have FXS.

Participants were between 12 and 28 years old, with 51 FXS participants (32 male, 19 female) and 19 DD participants. The two groups were well-matched on chronological and developmental age, and had similar mean scores on the Vineland Adaptive Behavior Scales (VABS), a well-established measure of developmental functioning. The average score was 58.5 (\(SD = 23.47\)) for individuals with FXS and 57.7 (\(SD= 16.78\)) for controls, indicating that the level of cognitive functioning in both groups was 2–3 SDs below the typical mean.

Participants were each interviewed by a clinically-trained experimenter. In our setup the camera was placed behind the patient and facing the interviewer. Figure 1 depicts the configuration of the interview, and of the physical environment. Eye-movements were recorded using a Tobii X120 remote corneal reflection eye-tracker, with time-synchronized input from the scene camera. The eye-tracker was spatially calibrated to the remote camera via the patient looking at a known set of locations prior to the interview.

4 Visual Fixation Features

A goal of our work is to design features that simultaneously provide insight into these disorders and allow for accurate classification between them. These features are the building blocks of our system, and the key challenge is engineering them to properly distill the most meaningful parts out of the raw eye-tracker and video footage. We capture the participant’s point of gaze and its distribution over the interviewer’s face, 5 times per second during the whole interview. There are 6 relevant regions of interest: nose, left eye, right eye, mouth, jaw, outside face. The precise detection of these fine-grained features enables us to study small changes in participants’ fixations at scale.

For each video frame, we detected a set of 69 landmarks on the interviewer’s face using a part-based model [16]. Figure 1 shows examples of landmark detections. In total, we processed 14,414,790 landmarks. We computed 59 K, 56 K and 156 K frames for DD, FXS-Female, and FXS-Male groups respectively. We evaluated a sample of 1 K randomly selected frames, out of which only a single frame was incorrectly annotated. We mapped the eye-tracking coordinates to the facial landmark coordinates with a linear transformation. Our features take the label of the cluster (e.g. jaw) holding the closest landmark to the participant point of gaze. We next present some descriptive analyses of these data.

Feature granularity. We want to analyze the relevance of our fine grained attention features. Participants—especially those with FXS—spent only a fraction of the time looking at the interviewer’s face. Analyzing the time series data of when individuals are glancing at the face of their interviewer (see Fig. 2), we observe high inter-group participant’s variance. For example, most of FXS-F individual sequences could be easily confused with the other groups.

Clinicians often express the opinion that the distribution of fixations, not just the sheer lack of face fixations—seem related to the general autism phenotype [8, 10]. This opinion is supported by the distributions in Fig. 3: DD and FXS-F are quite similar, whereas FXS-M is distinct. FXS-M focuses primarily on mouth (4) and nose (1) areas.

Attentional transitions. In addition to the distribution of fixations, clinicians also believe that the sequence of fixations describe underlying behavior. In particular, FXS participants often glance to the face quickly and then look away, or scan between non-eye regions. Figure 4 shows region-to-region transitions in a heatmap. There is a marked difference between the different disorders: Individuals with DD make more transitions, while those with FXS exhibit significantly less—congruent with the clinical intuition. The transitions between facial regions better identify the three groups than the transitions from non-face to face regions. FXS-M participants tend to swap their gaze quite frequently between mouth and nose, while the other two do not. DD participants exhibit much more movement between facial regions, without any clear preference. FXS-F patterns resemble DD, though the pattern is less pronounced.

Approximate Entropy. We next estimate Approximate Entropy (ApEn) analysis to provide a measure of how predictable a sequence is [13]. A lower entropy value indicates a higher degree of regularity in the signal. For each group (DD, FXS-Female, FXS-Male), we selected 15 random participants sequences. We compute ApEn by varying w (sliding window length). Figure 5 depicts this analysis. We can see that there is great variance amongst individuals of each population, many sharing similar entropy with participants of other groups. The high variability of the data sequences makes them harder to classify.

5 Classifiers

The goal of this work is to create an end-to-end system for classification of developmental disorders from raw visual information. So far we have introduced features that capture social attentional information and analyzed their temporal structure. We next need to construct methods capable of utilizing these features to predict the specific disorder of the patient.

Model (RNN). The Recurrent Neural Network (RNN) is a generalization of feedforward neural networks to sequences. Our deep learning model is an adaptation of the attention-enhanced RNN architecture proposed by Hinton et al. [15] (LSTM+A). The model has produced impressive results in other domains such as language modeling and speech processing. Our feature sequences fit this data profile. In addition, an encoder-decoder RNN architecture allows us to experiment with sequences of varying lengths in a cost-effective manner. Our actual models differ from LSTM+A in two ways. First, we have replaced the LSTM cells with GRU cells [2], which are memory-efficient and could provide a better fit to our data [9]. Second, our decoder produces a single output value (i.e. class). The decoder is a single-unit multi-layered RNN (without unfolding) and with a soft-max output layer. Conceptually it could be seen as a many-to-one RNN, but we present it as a configuration of [15] given its proximity and our adoption of the attention mechanism.

For our experiments, we used 3 RNN configurations: RNN_128: 3 layers of 128 units; RNN_256: 3 layers of 256 units; RNN_512: 2 layers of 512 units. These parameters were selected considering our GPU memory allocation limitation.

We trained our models for a total of 1000 epochs. We used batches of sequences, SGD with momentum and max gradient normalization (0.5).

Other Classifiers. We also trained shallow baseline classifiers. We engineer a convolutional neural network approach (CNN) that can exploit the local-temporal relationship of our data. It is composed of one hidden layer of 6 convolutional units followed by point-wise sigmoidal nonlinearities. The feature vectors computed across the units are concatenated and fed to an output layer composed of an affinity transformation followed by another sigmoid function. We also trained support vector machines (SVMs), Naive Bayes (NB) classifiers, and Hidden Markov Models (HMMs).

6 Experiments and Results

By varying the classification methods described in Sect. 5 we perform a quantitative evaluation of the overall system. We assume the gender of the patient is known, and select the clinically-relevant pair-wise classification experiments DD vs FXS-F and DD vs FXS-M. For the experiments we use 32 FXS-male, 19 FXS-female and 19 DD participants. To maintain equal data distribution in training and testing we build \(S_{train}\) and and \(S_{test}\) randomly shuffling participants of each class ensuring a 50 %/50 % distribution of the two participant classes over the sets. At each new training/testing fold the process is repeated so that the average classification results will represent the entire set of participants. We classify the developmental disorder of the participants, given their individual time-series feature data p, to evaluate the precision of our system. For N total participants, we create an 80 %/20 % training/testing dataset such that no participant’s data is shared between the two datasets. For each experiment, we performed 10-fold cross validation where each fold was defined by a new random 80/20 split of the participants–about 80 participant’s were tested per experiment.

Table 1. Comparison of precision of our system against other classifiers. Columns denote pairwise classification precision of participants for DD vs FXS-female and DD vs FXS-male binary classification. Classifiers are run on 3,10, and 50 seconds time windows. We compare the system classifier, RNN to CNN, SVM, NB, and HMM algorithms.

Full size table

Metric. We consider the binary classification of an unknown participant as having DD or FXS. We adopt a voting strategy where, given a patient’s data \(p=[f_1, f_2,....f_{T}]\), we classify all sub-sequences s of p of fixed length w using a sliding-window approach. In our experiments, w correspond to 3, 10, and 50 seconds of video footage. To predict the participant’s disorder, we employ a max-voting scheme over each class. The predicted class C of the participant is given by:

(1)

Where \(C_1, C_2 \in \{\text {DD}, \text {FXS-F}, \text {FXS-M}\}\), \(\text {Class}(s)\) is the output of a classifier given input s. We use 10 cross validation folds to compute the average classification precision.

Results. The results are reported in Table 1. We find that the highest average precision is attained using the RNN.512 model with a 50 second time window. It classifies DD versus FXS-F with 0.86 precision and DD versus FXS-M with 0.91 precision. We suspect that the salient results produced by the RNN_512 are related to its high capacity and its capability of representing complex temporal structures.

7 Conclusion

We hereby demonstrate the use of computer vision and machine learning techniques in a cost-effective system for assistive diagnosis of developmental disorders that exhibit visual phenotypic expression in social interactions. Data of experimenters interviewing participants with developmental disorders was collected using video and a remote eye-tracker. We built visual features corresponding to fine grained attentional fixations, and developed classification models using these features to discern between FXS and idiopathic developmental disorder. Despite finding a high degree of variance and noise in the signals used, our high accuracies imply the existence of temporal structures in the data.

This work serves as a proof of concept of the power of modern computer vision systems in assistive development disorder diagnosis. We are able to provide a high-probability prediction about specific developmental diagnoses based on a short eye-movement recording. This system, along with similar ones, could be leveraged for remarkably faster screening of individuals. Future work will consider extending this capability to a greater range of disorders and improving the classification accuracy.

References

Boraston, Z., Blakemore, S.J.: The application of eye-tracking technology in the study of autism. J. Physiol. 581(3), 893–898 (2007)
Article Google Scholar
Cho, K., van Merrienboer, B., Bahdanau, D., Bengio, Y.: On the properties of neural machine translation: Encoder-decoder approaches (2014). arXiv:1409.1259
Csibra, G., Gergely, G.: Social learning and social cognition: The case for pedagogy. In: Process of Change in Brain and Cognitive Development, Attention and Performance 9, pp. 152–158 (2006)
Google Scholar
Golarai, G., Grill-Spector, K., Reiss, A.: Autism and the development of face processing. Clin. Neurosci. Res. 6, 145–160 (2006)
Article Google Scholar
Hagerman, P.J.: The fragile x prevalence paradox. J. Med. Genet. 45, 498–499 (2008)
Article Google Scholar
Hall, S.S., Frank, M.C., Pusiol, G.T., Farzin, F., Lightbody, A.A., Reiss, A.L.: Quantifying naturalistic social gaze in fragile x syndrome using a novel eye tracking paradigm. Am. J. Med. Genet. Part B Neuropsychiatric Genet. 168, 564–572 (2015)
Article Google Scholar
Hashemi, J., Spina, T.V., Tepper, M., Esler, A., Morellas, V., Papanikolopoulos, N., Sapiro, G.: A computer vision approach for the assessment of autism-related behavioral markers. In: 2012 IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL), pp. 1–7. IEEE (2012)
Google Scholar
Jones, W., Klin, A.: Attention to eyes is present but in decline in 2-6-month-old infants later diagnosed with autism. Nature (2013)
Google Scholar
Jzefowicz, R., Zaremba, W., Sutskever, I.: An empirical exploration of recurrent network architectures. In: Bach, F.R., Blei, D.M. (eds.) ICML, vol. 37, pp. 2342–2350 (2015)
Google Scholar
Klin, A., Jones, W., Schultz, R., Volkmar, F., Cohen, D.: Visual fixation patterns during viewing of naturalistic social situations as predictors of social competence in individuals with autism. Arch. Gen. Psychiatry 59(9), 809–816 (2002)
Article Google Scholar
Kumar, Y., Dewal, M.L., Anand, R.S.: Epileptic seizure detection using DWT based fuzzy approximate entropy and support vector machine. Neurocomputing 133, 271–279 (2014)
Article Google Scholar
Rehg, J.M., Rozga, A., Abowd, G.D., Goodwin, M.S.: Behavioral imaging and autism. IEEE Pervasive Comput. 13(2), 84–87 (2014)
Article Google Scholar
Restrepo, J.F., Schlotthauer, G., Torres, M.E.: Maximum approximate entropy and r threshold: A new approach for regularity changes detection (2014). arXiv:1405.7637 (nlin.CD)
Sullivan, K., Hatton, D.D., Hammer, J., Sideris, J., Hooper, S., Ornstein, P.A., Bailey, D.B.: Sustained attention and response inhibition in boys with fragile X syndrome: measures of continuous performance. Am. J. Med. Gen. Part B Neuropsychiatric Gen. 144B(4), 517–532 (2007)
Article Google Scholar
Vinyals, O., Kaiser, L.u., Koo, T., Petrov, S., Sutskever, I., Hinton, G.: Grammar as a foreign language. In: Advances in Neural Information Processing Systems 28, pp. 2773–2781. Curran Associates, Inc. (2015). http://papers.nips.cc/paper/5635-grammar-as-a-foreign-language.pdf
Zhu, X., Ramanan, D.: Face detection, pose estimation, and landmark localization in the wild. In: CVPR, pp. 2879–2886. IEEE (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Stanford University, Stanford, USA
Guido Pusiol & Li Fei-Fei
Department of Electrical Engineering, Stanford University, Stanford, USA
Andre Esteva
Department of Psychiatry, Stanford University, Stanford, USA
Scott S. Hall
Department of Psychology, Stanford University, Stanford, USA
Michael Frank
Department of Medicine, Stanford University, Stanford, USA
Arnold Milstein

Authors

Guido Pusiol
View author publications
You can also search for this author in PubMed Google Scholar
Andre Esteva
View author publications
You can also search for this author in PubMed Google Scholar
Scott S. Hall
View author publications
You can also search for this author in PubMed Google Scholar
Michael Frank
View author publications
You can also search for this author in PubMed Google Scholar
Arnold Milstein
View author publications
You can also search for this author in PubMed Google Scholar
Li Fei-Fei
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guido Pusiol .

Editor information

Editors and Affiliations

University College London , London, United Kingdom
Sebastien Ourselin
The Hebrew University of Jerusalem , Jerusalem, Israel
Leo Joskowicz
Harvard Medical School , Boston, Massachusetts, USA
Mert R. Sabuncu
Istanbul Technical University , Istanbul, Turkey
Gozde Unal
Harvard Medical School and Brigham and Women's Hospital, Boston, Massachusetts, USA
William Wells

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pusiol, G., Esteva, A., Hall, S.S., Frank, M., Milstein, A., Fei-Fei, L. (2016). Vision-Based Classification of Developmental Disorders Using Eye-Movements. In: Ourselin, S., Joskowicz, L., Sabuncu, M., Unal, G., Wells, W. (eds) Medical Image Computing and Computer-Assisted Intervention – MICCAI 2016. MICCAI 2016. Lecture Notes in Computer Science(), vol 9901. Springer, Cham. https://doi.org/10.1007/978-3-319-46723-8_37

Download citation

DOI: https://doi.org/10.1007/978-3-319-46723-8_37
Published: 02 October 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46722-1
Online ISBN: 978-3-319-46723-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The Medical Image Computing and Computer Assisted Intervention Society (opens in a new tab)