Text classification with self-organizing maps: Some lessons learned

doi:10.1016/S0925-2312(98)00032-0

Neurocomputing

Volume 21, Issues 1–3, 6 November 1998, Pages 61-77

https://doi.org/10.1016/S0925-2312(98)00032-0 Get rights and content

Abstract

The self-organizing map has already found appreciation for document classification in the information retrieval community. The map display is a highly effective and intuitive metaphor for orientation in the information space established by a document collection. In this paper we discuss ways for using self-organizing maps for document classification. Furthermore, we argue in favor of paying more attention to the fact that document collections lend themselves naturally to a hierarchical structure defined by the subject matter of the documents. We take advantage of this fact by using a hierarchically organized neural network, built up from a number of independent self-organizing maps in order to enable the true establishment of a document taxonomy. As a highly convenient side effect of using such an architecture, the time needed for training is reduced substantially and the user is provided with an even more intuitive metaphor for visualization. Since the single layers of self-organizing maps represent different aspects of the document collection at different levels of detail, the neural network shows the document collection in a form comparable to an atlas where the user may easily select the most appropriate degree of granularity depending on the actual focus of interest during the exploration of the document collection.

Introduction

During recent years we have witnessed the appearance of an ever increasing flood of miscellaneous written information available in computer accessible form culminating in the advent of massive digital libraries. Powerful methods for organizing, exploring, and searching collections of text documents are thus needed to deal with that mass of information. Classical methods developed by the information retrieval community for searching documents are based on keywords assigned either manually or automatically by indexing the full text of the various documents. These methods may be enhanced with proximity search functionality and keyword combination according to Boole’s algebra. Other widely used approaches rather rely on document similarity measures based on a vector-space representation of the various texts. Still missing, however, are tools providing assistance for explorative search in document collections. Explorative search may be characterized as the struggle to uncover useful information when the user is unaware of appropriate keywords which could guide the search process towards relevant information. The reason for the existence of such a situation is twofold. Firstly, the user often has only limited insight in what is actually contained in the text collection and thus, has just vague expectations on what might be found. On the other hand, a usually convenient characteristic of natural language where the same fact of reality may be described in a number of different ways turns out to be a hindrance in locating relevant information because the same piece of information may be represented by using different sets of keywords. This is often referred to as the vocabulary problem in information retrieval literature [4].

Exploration of document archives may be supported by organizing the documents into taxonomies or hierarchies according to their subject matter. In parentheses we should note that such an organization is in use by librarians for centuries. In order to reach such a classification, a number of approaches are applicable. Among the oldest and most widely used ones we certainly have to mention statistics, especially cluster analysis. The usage of hierarchical cluster analysis for document classification has a long tradition in information retrieval research, and its specific strengths and weaknesses are well explored 32, 37.

The renaissance of a widespread interest in artificial neural network models commencing more than a decade ago is at least partly due to the availability of massive computer power at reasonable prices and the invention of a broad spectrum of highly effective learning rules. As a consequence, artificial neural networks are widely used to uncover structure in a large variety of actual input data. High-dimensional input data are especially challenging in this context. An almost perfect example of which is represented by text document classification where the various documents are, by nature, described in a high-dimensional feature space spanned by the keywords extracted from the documents.

In this paper we describe an approach to text classification relying on the self-organizing map performing the classification task. In order to use the self-organizing map for text classification, the various documents are represented as a histogram of word occurrences. The material contained in the paper is organized as follows. In Section 2we provide a brief description of the architecture and learning rule of self-organizing maps. Section 3is dedicated to an exposition of some issues we believe to be crucial for document classification. In Section 4the sample document collection used for our experiments is introduced. The results from the classification process are presented in Section 5. In particular, we give examples of document classification by using the basic model of self-organizing maps as well as a hierarchy of self-organizing maps in order to provide the user with a taxonomy of documents. In Section 6we give a brief review of the ever increasing number of reports on document classification by using self-organizing maps and related models adhering to the unsupervised learning paradigm. Finally, we present some conclusions in Section 7.

Section snippets

Self-organizing maps

The self-organizing map 12, 13 is a general unsupervised tool for ordering high-dimensional data in such a way that similar input patterns are grouped spatially close to one another. It consists of a layer of input units that receive input patterns and propagate them as they are to a set of output units. These output units are arranged according to some topology where the most common choice is a two-dimensional grid. Each of the output units i is assigned a weight vector $m_{i}$ .

During each learning

Issues in document classification

Generally, the task of text classification aims at uncovering the semantic similarities between various documents. In a first step, the documents have to be mapped onto some representation language in order to enable further analyses. This process is termed indexing in the information retrieval literature. A number of different strategies have been suggested over the years of information retrieval research. Still one of the most common representation techniques is single term full text indexing

A sample document collection

In order to present the effect of document classification in a uniform fashion, we will use the manual pages of a C++ class library as a reference document collection. In particular, we use the NIHCL [5], a class library developed at the National Institutes of Health, which consists of classes implementing basic data types such as Float or Date, classes implementing data structures as for instance Dictionary or Set, and a number of classes implementing file I/O functionality like OIOin or OIOout

A map of documents

A typical result from the application of self-organizing maps to this kind of data is presented in Fig. 2. In this case we used a 10×10 grid of units to represent the document collection. The graphical representation of the classification is straightforward in that each unit is represented either by means of a class name, in case the respective unit is the winner for that specific input pattern, or by means of a dot, in case the unit has not won the competition for any input pattern.

For the

Related work

Text classification with self-organizing maps has already produced an impressive record of publications. The work of Ref. [17] perhaps marks the first attempt to use the self-organizing map in an information retrieval setting. In this work the authors report the results from classifying a number of technical documents based on keywords extracted from the various titles. In total, the document representation is made up from 25 manually selected index terms and is thus not really realistic. The

Conclusions

In this paper we provided an account on the capabilities of self-organizing maps in performing a highly important task of information retrieval, namely text classification. As an experimental document collection we used the manual pages describing the various C++ classes contained in a real-world class library. This environment was chosen because the inherent structure of the document collection is known and thus available for evaluation of the classification results. We used a document

Acknowledgements

The author is grateful to the anonymous referees who provided lots of insightful comments on an earlier version of this paper.

Dieter Merkl received his diploma and doctoral degree in social and economic sciences from University of Vienna, Austria, in 1989 and 1995, respectively. From 1990 to 1994 he held a research position at University of Vienna. Since 1995 he is member of the academic faculty at Vienna University of Technology. During 1997 he was visiting research fellow with the Department of Computer Science, Royal Melbourne Institute of Technology. He has published over 50 articles in refereed journals and

References (38)

J. Blackmore, R. Miikkulainen, Incremental grid growing: encoding high-dimensional structure into a two-dimensional...
S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, R. Hashman, Indexing by latent semantic analysis, J. Amer. Soc....
B. Fritzke, Growing cell structures: a self-organizing network for unsupervised and supervised learning, Neural...
G.W. Furnas, T.K. Landauer, L.M. Gomez, S.T. Dumais, The vocabulary problem in human-system communications, Comm. ACM...
K.E. Gorlen, NIH Class Library Reference Manual, National Institutes of Health, Bethesda, MD,...
K.E. Gorlen, S. Orlow, P. Plexico, Abstraction and Object-Oriented Programming in C++, Wiley, New York,...
T. Honkela, Self-organizing maps of words for natural language processing applications, Proc. Int. ICSC Symp. on Soft...
T. Honkela, S. Kaski, K. Lagus, T. Kohonen, Newsgroup exploration with WEBSOM method and browsing interface, Helsinki...
T. Honkela, S. Kaski, K. Lagus, T. Kohonen, WEBSOM – self-organizing maps of document collections, Proc. Workshop on...
T. Honkela, V. Pulkki, T. Kohonen, Contextual relations of words in Grimm tales analyzed by self-organizing maps, Proc....

M. Köhle, D. Merkl, Visualizing similarities in high dimensional input spaces with a growing and splitting neural...

T. Kohonen, Self-organized formation of topologically correct feature maps, Biol. Cybernet. 43...

T. Kohonen

Self-organizing Maps

(1995)

T. Kohonen S. Kaski, Automatic coloring of data according to its cluster structure, in: E. Alhoniemi, J. Iivarinen, L....

T. Kohonen, S. Kaski, K. Lagus, T. Honkela, Very large two-level SOM for the browsing of newsgroups, Proc, Int. Conf....

K. Lagus, T. Honkela, S. Kaski, T. Kohonen, Self-organizing maps of document collections: a new approach to interactive...

X. Lin, D. Soergel, G. Marchionini, A self-organizing semantic map for information retrieval, Proc. Int. ACM SIGIR...

D. Merkl, A connectionist view on document classification, Proc. Australasian Database Conf., Adelaide, Australia,...

D. Merkl, Content-based document classification with highly compressed input data, Proc. Int. Conf. on Artificial...

Cited by (76)

An improved cuckoo search based extreme learning machine for medical data classification
2015, Swarm and Evolutionary Computation
Citation Excerpt :
Classification of the exponentially growing complex, continuous data with large number of records and features has become the most challenging data mining tasks of human activity. In the last 20 years, it is being applied in the field of pattern classification like optical character recognition [1], text and image classification [2], machine vision [3], fraud detection [4], natural language processing [5], market segmentation [6,7], bioinformatics [8], protein sequence classification [9], biomedical image classification [10] and real world data classification [11]. The research community has given increasing attention in developing fast and accurate classifiers with good generalization capability.
Machine learning techniques are being increasingly used for detection and diagnosis of diseases for its accuracy and efficiency in pattern classification. In this paper, improved cuckoo search based extreme learning machine (ICSELM) is proposed to classify binary medical datasets. Extreme learning machine (ELM) is widely used as a learning algorithm for training single layer feed forward neural networks (SLFN) in the field of classification. However, to make the model more stable, an evolutionary algorithm improved cuckoo search (ICS) is used to pre-train ELM by selecting the input weights and hidden biases. Like ELM, Moore–Penrose (MP) generalized inverse is used in ICSELM to analytically determines the output weights. To evaluate the effectiveness of the proposed model, four benchmark datasets, i.e. Breast Cancer, Diabetes, Bupa and Hepatitis from the UCI Repository of Machine Learning are used. A number of useful performance evaluation measures including accuracy, sensitivity, specificity, confusion matrix, Gmean, F-score and norm of the output weights as well as the area under the receiver operating characteristic (ROC) curve are computed. The results are analyzed and compared with both ELM based models like ELM, on-line sequential extreme learning algorithm (OSELM), CSELM and other artificial neural networks i.e. multi-layered perceptron (MLP), MLPCS, MLPICS and radial basis function neural network (RBFNN), RBFNNCS, RBFNNICS. The experimental results demonstrate that the ICSELM model outperforms other models.
Incorporating self-organizing map with text mining techniques for text hierarchy generation
2015, Applied Soft Computing Journal
Citation Excerpt :
Although data-oriented approaches may effectively generate adequate number of clusters or even hierarchies, they provide no insight into the meaning of data, let along guiding the training process. In recent years, the SOM was widely used for text document clustering and categorization [9–16]. Text categorization concerns of classifying documents into some categories according to their contents, characteristics, and properties.
Self-organizing maps (SOM) have been applied on numerous data clustering and visualization tasks and received much attention on their success. One major shortage of classical SOM learning algorithm is the necessity of predefined map topology. Furthermore, hierarchical relationships among data are also difficult to be found. Several approaches have been devised to conquer these deficiencies. In this work, we propose a novel SOM learning algorithm which incorporates several text mining techniques in expanding the map both laterally and hierarchically. On training a set of text documents, the proposed algorithm will first cluster them using classical SOM algorithm. We then identify the topics of each cluster. These topics are then used to evaluate the criteria on expanding the map. The major characteristic of the proposed approach is to combine the learning process with text mining process and makes it suitable for automatic organization of text documents. We applied the algorithm on the Reuters-21578 dataset in text clustering and categorization tasks. Our method outperforms two comparing models in hierarchy quality according to users’ evaluation. It also receives better F1-scores than two other models in text categorization task.
Variable weight neural networks and their applications on material surface and epilepsy seizure phase classifications
2015, Neurocomputing
Citation Excerpt :
Classification is a process that takes samples from objects and assigns each one of them to a pre-defined group or class label. This is a promising and important field of research which provides a solution to a wide range of applications e.g., classification of different investments and lending opportunities as acceptable or unacceptable risk [1], classification of electrocardiogram (ECG) arrhythmias [2], classification of ECG beat [3], facial recognition [4,5], hand-writing recognition [6–10], heart sound classification [11], human body posture classification [12], speaker verification [13], speech recognition [14,15] and text classification [16,17]. In general, a classification process usually consists of three main stages.
This paper presents a novel neural network having variable weights, which is able to improve its learning and generalisation capabilities, to deal with classification problems. The variable weight neural network (VWNN) allows its weights to be changed in operation according to the characteristic of the network inputs so that it can adapt to different characteristics of input data resulting in better performance compared with ordinary neural networks with fixed weights. The effectiveness of the VWNN is tested with the consideration of two real-life applications. The first application is on the classification of materials using the data collected by a robot finger with tactile sensors sliding along the surface of a given material. The second application considers the classification of seizure phases of epilepsy (seizure-free, pre-seizure and seizure phases) using real clinical data. Comparisons are performed with some traditional classification methods including neural network, k-nearest neighbours and naive Bayes classification techniques. It is shown that the VWNN classifier outperforms the traditional methods in terms of classification accuracy and robustness property when the input data is contaminated with noise.
Text style analysis using trace ratio criterion patch alignment embedding
2014, Neurocomputing
An effective algorithm for extracting cues of text styles is proposed in this paper. When processing document collections, the documents are first converted to a high dimensional data set with the assistant of a group of style markers. We also employ the Trace Ratio Criterion Patch Alignment Embedding (TR-PAE) to obtain lower dimensional representation in a textual space. The TR-PAE has some advantages that the inter-class separability and intra-class compactness are well characterized by the special designed intrinsic graph and penalty graph, which are based on discriminative patch alignment strategy. Another advantage is that the proposed method is based on trace ratio criterion, which directly represents the average between-class distance and average within-class distance in the low-dimensional space. To evaluate our proposed algorithm, three corpuses are designed and collected using existing popular corpuses and real-life data covering diverse topics and genres. Extensive simulations are conducted to illustrate the feasibility and effectiveness of our implementation. Our simulations demonstrate that the proposed method is able to extract the deeply hidden information of styles of given documents, and efficiently conduct reliable text analysis results on text styles can be provided.
Knowledge discovery in inspection reports of marine structures
2014, Expert Systems with Applications
Citation Excerpt :
Moreover, by the cooperative learning process of these neighbor codebooks, the projection of the input space on the low-dimensional grid is accomplished with the vector quantization. Due to these two key properties, the SOM has been popularly used as a data clustering and visualization (Yen & Wu, 2005), and also successively applied in text processing and classification (Merkl, 1998; Kohonen et al., 2000). Fig. 6 also shows a four zoomed-in maps where significant keywords describing each codebook are denoted.
Inspection reports, commonly called “punches” in the marine structuring domain, are written documents about defects or supplementations on marine structures. Analyzing the inspection reports improves the construction process for the structure and prevents additional “punches.” This consequently reduces construction delays and supplementary costs. The free-form texts of the reports, however, hinder management from understanding the nature of defects. Therefore, we applied Knowledge Discovery in the Textual Databases (KDT) process to answer the questions, “what kinds of defects are reported while inspecting a marine structure, and which of them are closely related?” In particular, we propose a concept extraction and linkage approach as an “add-on” module for the Self-Organizing Map (SOM), a clustering algorithm for document organization. A purely data-driven graph is derived for defect-types, which gives it in an easy-to-understand form for domain experts and reduces the gap between data analysis and its practical use. Interpretation with domain experts showed that our KDT process is useful in understanding the nature of defects in the domain and systematically responding to some other related defects.
Locally linear embedding based on correntropy measure for visualization and classification
2012, Neurocomputing
Citation Excerpt :
This limiting representation is overcome in [3], where the Locally Linear Embedding (LLE) technique is presented, which finds underlying data structures from non-linear manifolds. Some other NLDR salient methods are the following: (a) Kohonen maps [4] implements a mapping from a higher dimensional input space to a lower dimensional map space, preserving approximately neighborhoods. The outputs are arranged according to some topology where the most common choice is a two-dimensional grid.
Linear dimensionality reduction (DR) is a widely used technique in pattern recognition to control the dimensionality of input data, but it does neither preserve discriminability nor is capable of discovering nonlinear degrees of freedom present in natural observations. More recently, nonlinear dimensionality reduction (NLDR) algorithms have been developed taking advantage of the fact that data may lie on an embedded nonlinear manifold within an high dimensional feature space. Nevertheless, if the input data is corrupted (noise and outliers), most of nonlinear techniques specially Locally Linear Embedding (LLE) do not produce suitable embedding results. The culprit is the Euclidean distance (cost function in LLE) that does not correctly represent the dissimilarity between objects, increasing the error because of corrupted observations. In this work, the Euclidean distance is replaced by the correntropy induced metric (CIM), which is particularly useful to handle outliers. Moreover, we also extend NLDR to handle manifold divided into separated groups or several manifolds at the same time by employing class label information (CLI), yielding a discriminative representation of data on low dimensional space. Correntropy LLE+CLI approach is tested for visualization and classification on noisy artificial and real-world data sets. The obtained results confirm the capabilities of the discussed approach reducing the negative effects of outliers and noise on the low dimensional space. Besides, it outperforms the other NLDR techniques, in terms of classification accuracy.

View all citing articles on Scopus

View full text