Elsevier

Neurocomputing

Volume 21, Issues 1–3, 6 November 1998, Pages 61-77
Neurocomputing

Text classification with self-organizing maps: Some lessons learned

https://doi.org/10.1016/S0925-2312(98)00032-0Get rights and content

Abstract

The self-organizing map has already found appreciation for document classification in the information retrieval community. The map display is a highly effective and intuitive metaphor for orientation in the information space established by a document collection. In this paper we discuss ways for using self-organizing maps for document classification. Furthermore, we argue in favor of paying more attention to the fact that document collections lend themselves naturally to a hierarchical structure defined by the subject matter of the documents. We take advantage of this fact by using a hierarchically organized neural network, built up from a number of independent self-organizing maps in order to enable the true establishment of a document taxonomy. As a highly convenient side effect of using such an architecture, the time needed for training is reduced substantially and the user is provided with an even more intuitive metaphor for visualization. Since the single layers of self-organizing maps represent different aspects of the document collection at different levels of detail, the neural network shows the document collection in a form comparable to an atlas where the user may easily select the most appropriate degree of granularity depending on the actual focus of interest during the exploration of the document collection.

Introduction

During recent years we have witnessed the appearance of an ever increasing flood of miscellaneous written information available in computer accessible form culminating in the advent of massive digital libraries. Powerful methods for organizing, exploring, and searching collections of text documents are thus needed to deal with that mass of information. Classical methods developed by the information retrieval community for searching documents are based on keywords assigned either manually or automatically by indexing the full text of the various documents. These methods may be enhanced with proximity search functionality and keyword combination according to Boole’s algebra. Other widely used approaches rather rely on document similarity measures based on a vector-space representation of the various texts. Still missing, however, are tools providing assistance for explorative search in document collections. Explorative search may be characterized as the struggle to uncover useful information when the user is unaware of appropriate keywords which could guide the search process towards relevant information. The reason for the existence of such a situation is twofold. Firstly, the user often has only limited insight in what is actually contained in the text collection and thus, has just vague expectations on what might be found. On the other hand, a usually convenient characteristic of natural language where the same fact of reality may be described in a number of different ways turns out to be a hindrance in locating relevant information because the same piece of information may be represented by using different sets of keywords. This is often referred to as the vocabulary problem in information retrieval literature [4].

Exploration of document archives may be supported by organizing the documents into taxonomies or hierarchies according to their subject matter. In parentheses we should note that such an organization is in use by librarians for centuries. In order to reach such a classification, a number of approaches are applicable. Among the oldest and most widely used ones we certainly have to mention statistics, especially cluster analysis. The usage of hierarchical cluster analysis for document classification has a long tradition in information retrieval research, and its specific strengths and weaknesses are well explored 32, 37.

The renaissance of a widespread interest in artificial neural network models commencing more than a decade ago is at least partly due to the availability of massive computer power at reasonable prices and the invention of a broad spectrum of highly effective learning rules. As a consequence, artificial neural networks are widely used to uncover structure in a large variety of actual input data. High-dimensional input data are especially challenging in this context. An almost perfect example of which is represented by text document classification where the various documents are, by nature, described in a high-dimensional feature space spanned by the keywords extracted from the documents.

In this paper we describe an approach to text classification relying on the self-organizing map performing the classification task. In order to use the self-organizing map for text classification, the various documents are represented as a histogram of word occurrences. The material contained in the paper is organized as follows. In Section 2we provide a brief description of the architecture and learning rule of self-organizing maps. Section 3is dedicated to an exposition of some issues we believe to be crucial for document classification. In Section 4the sample document collection used for our experiments is introduced. The results from the classification process are presented in Section 5. In particular, we give examples of document classification by using the basic model of self-organizing maps as well as a hierarchy of self-organizing maps in order to provide the user with a taxonomy of documents. In Section 6we give a brief review of the ever increasing number of reports on document classification by using self-organizing maps and related models adhering to the unsupervised learning paradigm. Finally, we present some conclusions in Section 7.

Section snippets

Self-organizing maps

The self-organizing map 12, 13 is a general unsupervised tool for ordering high-dimensional data in such a way that similar input patterns are grouped spatially close to one another. It consists of a layer of input units that receive input patterns and propagate them as they are to a set of output units. These output units are arranged according to some topology where the most common choice is a two-dimensional grid. Each of the output units i is assigned a weight vector mi.

During each learning

Issues in document classification

Generally, the task of text classification aims at uncovering the semantic similarities between various documents. In a first step, the documents have to be mapped onto some representation language in order to enable further analyses. This process is termed indexing in the information retrieval literature. A number of different strategies have been suggested over the years of information retrieval research. Still one of the most common representation techniques is single term full text indexing

A sample document collection

In order to present the effect of document classification in a uniform fashion, we will use the manual pages of a C++ class library as a reference document collection. In particular, we use the NIHCL [5], a class library developed at the National Institutes of Health, which consists of classes implementing basic data types such as Float or Date, classes implementing data structures as for instance Dictionary or Set, and a number of classes implementing file I/O functionality like OIOin or OIOout

A map of documents

A typical result from the application of self-organizing maps to this kind of data is presented in Fig. 2. In this case we used a 10×10 grid of units to represent the document collection. The graphical representation of the classification is straightforward in that each unit is represented either by means of a class name, in case the respective unit is the winner for that specific input pattern, or by means of a dot, in case the unit has not won the competition for any input pattern.

For the

Related work

Text classification with self-organizing maps has already produced an impressive record of publications. The work of Ref. [17] perhaps marks the first attempt to use the self-organizing map in an information retrieval setting. In this work the authors report the results from classifying a number of technical documents based on keywords extracted from the various titles. In total, the document representation is made up from 25 manually selected index terms and is thus not really realistic. The

Conclusions

In this paper we provided an account on the capabilities of self-organizing maps in performing a highly important task of information retrieval, namely text classification. As an experimental document collection we used the manual pages describing the various C++ classes contained in a real-world class library. This environment was chosen because the inherent structure of the document collection is known and thus available for evaluation of the classification results. We used a document

Acknowledgements

The author is grateful to the anonymous referees who provided lots of insightful comments on an earlier version of this paper.

Dieter Merkl received his diploma and doctoral degree in social and economic sciences from University of Vienna, Austria, in 1989 and 1995, respectively. From 1990 to 1994 he held a research position at University of Vienna. Since 1995 he is member of the academic faculty at Vienna University of Technology. During 1997 he was visiting research fellow with the Department of Computer Science, Royal Melbourne Institute of Technology. He has published over 50 articles in refereed journals and

References (38)

  • J. Blackmore, R. Miikkulainen, Incremental grid growing: encoding high-dimensional structure into a two-dimensional...
  • S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, R. Hashman, Indexing by latent semantic analysis, J. Amer. Soc....
  • B. Fritzke, Growing cell structures: a self-organizing network for unsupervised and supervised learning, Neural...
  • G.W. Furnas, T.K. Landauer, L.M. Gomez, S.T. Dumais, The vocabulary problem in human-system communications, Comm. ACM...
  • K.E. Gorlen, NIH Class Library Reference Manual, National Institutes of Health, Bethesda, MD,...
  • K.E. Gorlen, S. Orlow, P. Plexico, Abstraction and Object-Oriented Programming in C++, Wiley, New York,...
  • T. Honkela, Self-organizing maps of words for natural language processing applications, Proc. Int. ICSC Symp. on Soft...
  • T. Honkela, S. Kaski, K. Lagus, T. Kohonen, Newsgroup exploration with WEBSOM method and browsing interface, Helsinki...
  • T. Honkela, S. Kaski, K. Lagus, T. Kohonen, WEBSOM – self-organizing maps of document collections, Proc. Workshop on...
  • T. Honkela, V. Pulkki, T. Kohonen, Contextual relations of words in Grimm tales analyzed by self-organizing maps, Proc....
  • M. Köhle, D. Merkl, Visualizing similarities in high dimensional input spaces with a growing and splitting neural...
  • T. Kohonen, Self-organized formation of topologically correct feature maps, Biol. Cybernet. 43...
  • T. Kohonen

    Self-organizing Maps

    (1995)
  • T. Kohonen S. Kaski, Automatic coloring of data according to its cluster structure, in: E. Alhoniemi, J. Iivarinen, L....
  • T. Kohonen, S. Kaski, K. Lagus, T. Honkela, Very large two-level SOM for the browsing of newsgroups, Proc, Int. Conf....
  • K. Lagus, T. Honkela, S. Kaski, T. Kohonen, Self-organizing maps of document collections: a new approach to interactive...
  • X. Lin, D. Soergel, G. Marchionini, A self-organizing semantic map for information retrieval, Proc. Int. ACM SIGIR...
  • D. Merkl, A connectionist view on document classification, Proc. Australasian Database Conf., Adelaide, Australia,...
  • D. Merkl, Content-based document classification with highly compressed input data, Proc. Int. Conf. on Artificial...
  • Cited by (76)

    • An improved cuckoo search based extreme learning machine for medical data classification

      2015, Swarm and Evolutionary Computation
      Citation Excerpt :

      Classification of the exponentially growing complex, continuous data with large number of records and features has become the most challenging data mining tasks of human activity. In the last 20 years, it is being applied in the field of pattern classification like optical character recognition [1], text and image classification [2], machine vision [3], fraud detection [4], natural language processing [5], market segmentation [6,7], bioinformatics [8], protein sequence classification [9], biomedical image classification [10] and real world data classification [11]. The research community has given increasing attention in developing fast and accurate classifiers with good generalization capability.

    • Incorporating self-organizing map with text mining techniques for text hierarchy generation

      2015, Applied Soft Computing Journal
      Citation Excerpt :

      Although data-oriented approaches may effectively generate adequate number of clusters or even hierarchies, they provide no insight into the meaning of data, let along guiding the training process. In recent years, the SOM was widely used for text document clustering and categorization [9–16]. Text categorization concerns of classifying documents into some categories according to their contents, characteristics, and properties.

    • Variable weight neural networks and their applications on material surface and epilepsy seizure phase classifications

      2015, Neurocomputing
      Citation Excerpt :

      Classification is a process that takes samples from objects and assigns each one of them to a pre-defined group or class label. This is a promising and important field of research which provides a solution to a wide range of applications e.g., classification of different investments and lending opportunities as acceptable or unacceptable risk [1], classification of electrocardiogram (ECG) arrhythmias [2], classification of ECG beat [3], facial recognition [4,5], hand-writing recognition [6–10], heart sound classification [11], human body posture classification [12], speaker verification [13], speech recognition [14,15] and text classification [16,17]. In general, a classification process usually consists of three main stages.

    • Knowledge discovery in inspection reports of marine structures

      2014, Expert Systems with Applications
      Citation Excerpt :

      Moreover, by the cooperative learning process of these neighbor codebooks, the projection of the input space on the low-dimensional grid is accomplished with the vector quantization. Due to these two key properties, the SOM has been popularly used as a data clustering and visualization (Yen & Wu, 2005), and also successively applied in text processing and classification (Merkl, 1998; Kohonen et al., 2000). Fig. 6 also shows a four zoomed-in maps where significant keywords describing each codebook are denoted.

    • Locally linear embedding based on correntropy measure for visualization and classification

      2012, Neurocomputing
      Citation Excerpt :

      This limiting representation is overcome in [3], where the Locally Linear Embedding (LLE) technique is presented, which finds underlying data structures from non-linear manifolds. Some other NLDR salient methods are the following: (a) Kohonen maps [4] implements a mapping from a higher dimensional input space to a lower dimensional map space, preserving approximately neighborhoods. The outputs are arranged according to some topology where the most common choice is a two-dimensional grid.

    View all citing articles on Scopus

    1. Download : Download full-size image
    Dieter Merkl received his diploma and doctoral degree in social and economic sciences from University of Vienna, Austria, in 1989 and 1995, respectively. From 1990 to 1994 he held a research position at University of Vienna. Since 1995 he is member of the academic faculty at Vienna University of Technology. During 1997 he was visiting research fellow with the Department of Computer Science, Royal Melbourne Institute of Technology. He has published over 50 articles in refereed journals and international conferences. His current research interests include neural computation, information retrieval, and software engineering.

    View full text