Elsevier

Neurocomputing

Volume 147, 5 January 2015, Pages 60-70
Neurocomputing

Self-organization and missing values in SOM and GTM

https://doi.org/10.1016/j.neucom.2014.02.061Get rights and content

Abstract

In this paper, we study fundamental properties of the Self-Organizing Map (SOM) and the Generative Topographic Mapping (GTM), ramifications of the initialization of the algorithms and properties of the algorithms in the presence of missing data. We show that the commonly used principal component analysis (PCA) initialization of the GTM does not guarantee good learning results with high-dimensional data. Initializing the GTM with the SOM is shown to yield improvements in self-organization with three high-dimensional data sets: commonly used MNIST and ISOLET data sets and epigenomic ENCODE data set. We also propose a revision of handling missing data to the batch SOM algorithm called the Imputation SOM and show that the new algorithm is more robust in the presence of missing data. We benchmark the performance of the topographic mappings in the missing value imputation task and conclude that there are better methods for this particular task. Finally, we announce a revised version of the SOM Toolbox for Matlab with added GTM functionality.

Introduction

Topographic mappings, such as the Self-Organizing Map (SOM) [1], [2] and the Generative Topographic Mapping (GTM) [3], are useful tools in inspecting and visualizing high-dimensional data. The SOM was originally inspired by neuroscientific research on cortical organization, and the algorithm models the basic principles of the organization process at a general level. The SOM has been shown to serve its purpose well, especially when the faithfulness (precision) of the mapping from a high-dimensional space is considered [4]. In practice, the SOM has proved to be a robust approach tested in thousands of different applications [5], [6], [7]. The GTM was inspired by the SOM algorithm, while operating in the probabilistic framework which provides well-founded regularization and model comparison [3]. In this paper, we show that both methods have their own strengths over the other and the methods may even benefit each other. We investigate applicability of the methods in high-dimensional, real-life data sets and provide methodological improvements in the presence of missing data.

Visualization of biological and life science data is an important task in the rapidly evolving field of bioinformatics. New kinds of measurement techniques and visualization methods appear at a constant pace (see, e.g., www.vizbi.org), but many practitioners still turn to rudimentary methods, such as hierarchical clustering and heatmaps. Recently, [8], [9] have used the SOM in order to cluster genome segmentation regions based on different assay signal characteristics gathered in the Encyclopedia of DNA Elements (ENCODE) project. The SOM is particularly well suited for many visualization tasks on biological data because of its computational simplicity and relatively loose prior assumptions on the data. As we will show, the Gaussian noise model assumed in the GTM is a critical constraint for many high dimensional data sets. Furthermore, a SOM-type mapping has also been adapted to arbitrary data for which the mutual pairwise distances are defined [10] allowing one to compute SOMs only based on pairwise distance matrices. A comprehensive review of visualization methods for large data sets can be found, e.g., in [11].

Missing data are a common problem in many data-dependent fields ranging from social sciences to economics and from political research to entertainment industry. In fields where conducting surveys or polls is commonplace, missing data occurs, for instance, when people refuse to answer to specific questions or some people cannot be contacted. In the movie business, predicting customer preferences is literally a million dollar quest. The Netflix Prize (see, e.g., [12]) was an open competition to devise the best recommendation system to predict user ratings for films based on previous ratings. In the second part of this paper, we present a revision to the batch SOM algorithm, called the Imputation SOM, which is shown to improve the behavior of the SOM algorithm in the presence of missing data.

This paper is organized as follows. 2 Self-organizing map, 3 Generative topographic mapping introduce the SOM and the GTM models, respectively. In Section 4, the properties of the models are compared in terms of self-organization and convergence. We show that using the SOM for initializing the GTM may improve the learning results in some cases. Section 5 explains the treatment of missing values in the GTM and adapts the same principled way into the SOM. Performance of the algorithms is compared in a missing value imputation task. Finally, the results and possible future work are discussed in Section 6.

In all the experiments, the SOM Toolbox [13] and Netlab [14] software packages are used. The GTM scripts in Netlab are revised to handle data with missing values and a sequential training algorithm is contributed. Also, an issue of small probabilities being rounded to zero due to insufficient floating point precision was solved. Finally, we announce a revised version of the SOM Toolbox which incorporates GTM functionality. An up-to-date version of the SOM Toolbox is available at

http://research.ics.aalto.fi/software/somtoolbox

Section snippets

Self-organizing map

The self-organizing map (SOM) [2] discovers some underlying structure in data using K map units, prototypes or reference vectors {mi}. For the prototypes, explicit neighborhood relations have been defined. The classical sequential SOM algorithm proceeds by processing one data point x(t) at a time. Euclidean, or any other suitable distance measure is used to find the best-matching unit given by mc(x(t))=argminix(t)mi. The reference vectors are then updated using the update rule mi(t+1)=mi(t)+h

Generative topographic mapping

The Generative Topographic Mapping (GTM) [3], [21] is a nonlinear latent variable model which was proposed as a probabilistic alternative to the SOM. Loosely speaking, it extends the SOM in a similar manner as Gaussian mixture model extends k-means clustering. This is achieved by working in a probabilistic framework where data vectors have posterior probabilities given a map unit. Hence, instead of possessing only one best-matching unit, each data vector contributes to many reference vectors

Self-organization and convergence

Both the GTM and the batch SOM require careful initialization in order to self-organize [23], [24]. For both algorithms, the common choice is to initialize according to the plane spanned by the two main principal components of the data. In the batch SOM, the neighborhood is annealed during the learning which decreases the rigidness of the map. The most important advantages of the batch SOM when compared to the classical sequential SOM are quick convergence and computational simplicity [24].

As

Missing values

In this section, we discuss the behavior of topographic mappings in the presence of missing values. We start by showing how missing values are treated in the GTM and develop the same idea for the SOM. The section is concluded by an experimental study where even low-dimensional data sets reveal differences between the studied algorithms.

In all what follows, missing-at-random (MAR) data is assumed. This means that the probability of missingness is independent of missing values given the observed

Conclusions and discussion

In this paper, we have studied convergence properties of the SOM and the GTM and their behavior in the presence of missing data. We also showed that initializing the GTM with the SOM may be beneficial in some cases where the GTM with the conventional PCA initialization fails to fit the data. This was demonstrated using the ISOLET, MNIST and ENCODE data sets. The initialization seems to have very little effect with the wine data set, the data with lowest dimensionality used in our experiments.

We

Tommi Vatanen is a Ph.D. student at Aalto University, Finland. He received his M.Sc. in bioinformation technology in 2012 from Aalto University School of Electrical Engineering. He is also affiliated with Broad Institute of MIT and Harvard in Cambridge, USA, where he is currently conducting research on human gut microbiome as a visiting graduate student.

References (37)

  • M. Pöllä, T. Honkela, T. Kohonen, Bibliography of Self-Organizing Map (SOM) Papers: 2002–2005 Addendum, Technical...
  • An integrated encyclopedia of DNA elements in the human genome, Nature 489 (7414) (2012) 57–74. URL...
  • A. Mortazavi et al.

    Integrating and mining the chromatin landscape of cell-type specificity using self-organizing maps

    Genome Res.

    (2013)
  • B. Hammer et al.

    How to visualize large data sets?

  • Y. Koren, The BellKor Solution to the Netflix Grand Prize,...
  • J. Vesanto, J. Himberg, E. Alhoniemi, J. Parhankangas, Self-organizing map in matlab: the SOM toolbox, in: The Matlab...
  • NETLAB: Algorithms for Pattern Recognition, Springer-Verlag New York, Inc., New York, NY, USA,...
  • P. Koikkalainen, E. Oja, Self-organizing hierarchical feature maps, in: IJCNN, vol. 2, 1990, pp....
  • Cited by (90)

    View all citing articles on Scopus

    Tommi Vatanen is a Ph.D. student at Aalto University, Finland. He received his M.Sc. in bioinformation technology in 2012 from Aalto University School of Electrical Engineering. He is also affiliated with Broad Institute of MIT and Harvard in Cambridge, USA, where he is currently conducting research on human gut microbiome as a visiting graduate student.

    Maria Osmala received the M.Sc. degree in bioinformation technology from Aalto University School of Electrical Engineering, Finland, in 2011. She is currently a doctoral student in Department of Information and Computer Science, Aalto University School of Science, Espoo, Finland. Her research interests include ChIP-seq data analysis in epigenetics studies, regulatory sequence prediction in human genome and data fusion in bioinformatics and computational systems biology.

    Tapani Raiko received his D.Sc. degree in Computer Science in 2006 from Helsinki University of Technology. He is an Assistant Professor (tenure track) and an Academy Research Fellow at Aalto University School of Science. His research focus is deep learning.

    Krista lagus is a senior researcher, group leader, and Ph.D. with experience in starting and leading research and new innovation-oriented activities. She has written over 70 scientific publications. Developed jointly two successful language technology methods and sofware applications, namely Websom and Morfessor. Started, planned and led EIT ICT Labs Wellbeing Innovation Camp 2010 and 2012.

    Marko Sysi-Aho is a principal scientist and team leader of the Biosystems modelling team at the Technical research centre of Finland (VTT). He completed his Ph.D. in Computational Sciences at Aalto University in 2005. His current research focus is on medical applications of biosystems modelling, mainly related to development of methods for analysis and integration of metabolomics data with other data including environmental and life style factors.

    Matej Orešič holds a PhD in biophysics from Cornell University. Since 2014 he is Principal Investigator at Steno Diabetes Center (Gentofte, Denmark), where he leads a Department of Systems Medicine. He is also an affiliated group leader at the Turku Centre for Biotechnology (Turku, Finland) and a principal investigator in the Academy of Finland Centre of Excellence in Molecular Systems Immunology and Physiology Research. His main research areas are metabolomics applications in biomedical research and integrative bioinformatics. He is particularly interested in the identification of disease vulnerabilities associated with different metabolic phenotypes and the underlying mechanisms linking these vulnerabilities with the development of specific disorders or their co-morbidities, with specific focus on obesity and diabetes and their co-morbidities. He has also initiated the popular MZmine open source project, leading to popular software for metabolomics data processing. Prior to joining Steno Diabetes Center, Dr. Orešič was research professor at VTT Technical Research Centre of Finland (Espoo, Finland), head of computational biology and modeling at Beyond Genomics, Inc. (Waltham/MA) and bioinformatician at LION Bioscience Research in Cambridge/MA.

    Timo Honkela, Ph.D., is professor at the Department of Modern Languages at University of Helsinki. He has conducted research on several areas related to knowledge engineering, cognitive modeling and natural language processing. This includes a central role in the development of the Websom method for visual information retrieval and text mining based on the Kohonen self-organizing map algorithm. Honkela is a former long-term chairman of the Finnish Artificial Intelligence Society.

    Harri Lähdesmäki received the M.Sc. and D.Sc. degrees from Tampere University of Technology in 2001 and 2005, respectively. Between 09/2002 and 03/2003, he was a visiting researcher at the Cancer Genomics Laboratory, The University of Texas M.D. Anderson Cancer Center, and from 2005 to 2007 he worked as a postdoctoral fellow at the Institute for Systems Biology, Seattle, WA, USA. From 2007 to 2008, he worked as an Assistant Professor in the Department of Signal Processing at Tampere University of Technology, Finland, followed by a pro term Professor position in Helsinki University of Technology until 2012. In autumn 2012 he was appointed to Assistant Professor (tenure track) and Academy Research Follow positions in the Department of Information and Computer Science at Aalto University School of Science (formerly known as Helsinki University of Technology). He is also an affiliated group leader at the Turku Center for Biotechnology, University of Turku. His research interests include computational and systems biology, regulatory genomics, statistical modeling and machine learning, with applications to immunology, stem cell and T1D.

    View full text