Marginal median SOM for document organization and retrieval

doi:10.1016/j.neunet.2003.08.008

Neural Networks

Volume 17, Issue 3, April 2004, Pages 365-377

https://doi.org/10.1016/j.neunet.2003.08.008 Get rights and content

Abstract

The self-organizing map algorithm has been used successfully in document organization. We now propose using the same algorithm for document retrieval. Moreover, we test the performance of the self-organizing map by replacing the linear Least Mean Squares adaptation rule with the marginal median. We present two implementations of the latter variant of the self-organizing map by either quantizing the real valued feature vectors to integer valued ones or not. Experiments performed using both implementations demonstrate a superior performance against the self-organizing map based method in terms of the number of training iterations needed so that the mean square error (i.e. the average distortion) drops to the e⁻¹=36.788% of its initial value. Furthermore, the performance of a document organization and retrieval system employing the self-organizing map architecture and its variant is assessed using the average recall–precision curves evaluated on two corpora; the first comprises of manually selected web pages over the Internet having touristic content and the second one is the Reuters-21578, Distribution 1.0.

Introduction

Due to their wide range of applications, artificial neural networks (ANN) have been an active research area for the past three decades (Haykin, 1999). A large variety of learning algorithms (i.e. error-correction, memory-based, Hebbian, Boltzmann machines, supervised or unsupervised) have been evolved and being employed in ANNs. A further categorization divides the network architectures into three distinct categories: feedforward, feedbackward, and competitive (Haykin, 1999).

The self-organizing maps (SOMs) or Kohonen's feature maps are feedforward, competitive ANN that employ a layer of input neurons and a single computational layer (Kohonen, 1997, Kohonen, 1990). The neurons on the computational layer are fully connected to the input layer and are arranged on a N-dimensional lattice. Low-dimensional grids, usually two dimensional (2D) or 3D, have prominent visualization properties, and therefore, are employed on the visualization of high-dimensional data. In this paper, we shall use the SOM algorithm to cluster contextually similar documents into classes. Therefore, we shall focus on the 2D lattice in order to visualize the resulting classes on the plane. For the 2D lattice, the computational layer can have either a hexagonal or orthogonal topology. In hexagonal lattices, each neuron has six equal-distant neighbors, whereas orthogonal lattices can be either four- or eight-connected. As for the competitive nature of the algorithm, this is expressed by the fact that only the neuron which is ‘closer’ to the input feature vector with respect to a given metric as well as its neighbors are updated every time a new feature is presented to the ANN.

The SOMs are capable of forming a nonlinear transformation or mapping from an arbitrary dimensional data manifold, the so-called input space, onto the low-dimensional lattice (Haykin, 1999, Kohonen, 1997). The algorithm takes into consideration the relations between the input feature vectors and computes a set of reference vectors in the output space that provide an efficient vector quantization of the input space. Moreover, the density of neurons, i.e. the number of neurons in a small volume of the input space matches the probability density function (pdf) of the feature vectors. Generally, the approximation error is measured by the Mean Square Error (MSE). In doing so, the algorithm employs a linear Least Mean Square adaptation rule for updating the reference vector of each neuron. When the training procedure is led to equilibrium it results to a partition of the domain of the vector-valued observations called Voronoi tessellation (Kohonen, 1997, Ritter and Schulten, 1988). The convergence properties of SOMs are studied in Ritter and Schulten, 1988, Erwin et al., 1992.

A complete and thorough investigation regarding the available variants of the SOM algorithm can be found in Kohonen, 1997, Kangas et al., 1990. One such frequently used variant is the batch-map. The batch-map estimates the sample mean of the feature vectors that are assigned to each reference vector and subsequently smooths the sample mean to yield an updated reference vector. A trade off is made between the speed and degradation of the clustering accuracy (Fort, Letremy, & Cottrell, 2002). The batch-map is faster than the on-line SOM algorithm. However, it produces unbalanced classes of inferior quality than those produced by on-line SOM algorithm. In the experiments reported in Section 5, the precision rate of the batch SOM algorithm is always less than that of the on-line SOM for all recall rates.

The ability of the SOM algorithm to produce spatially organized representations of the input space can be utilized in document organization, where organization refers to the representation and storage of the available data. In this paper, we exploit this algorithm also for document retrieval. Retrieval refers to the exploration of the organized document repository through specific user-defined queries (Yates & Neto, 1999).

Prior to the document indexing, due to the nature of the SOM algorithm the available textual data have to be transcribed into a numerical form. Among the three widely accepted encoding models that are used by the information retrieval (IR) community (Yates & Neto, 1999), namely the boolean, the probabilistic, and the vector space model, the latter model is the most appropriate for the SOM algorithm. In the vector space model, the documents and the queries used in the training and the retrieval phase are represented by high-dimensional vectors. Each vector component corresponds to a different word type (i.e. a distinct word appearance) in the document collection (also called corpus). Subsequently, the documents can be easily clustered into contextually related collections by using any distance metric, such as the Euclidean, the Mahalanobis, the city-block, etc. Such a clustering is based on the assumption that the contextual correlation between the documents continues to exist in their vectorial representation. The degree of similarity between a given query and the documents is measured using the same distance metric and the documents marked as being relevant to the query can be ranked in a decreasing order of similarity according to this distance metric (Yates & Neto, 1999).

An architecture based on the SOM algorithm that is capable of clustering documents according to their semantic similarities is the so-called WEBSOM architecture (Kohonen, 1998, Kohonen et al., 1999, Kohonen et al., 2000). The WEBSOM consists of two distinct layers where the SOM algorithm is applied. The first layer is used to cluster the words found in the available training documents into semantically related collections. The second layer, which is activated after the completion of the first layer, clusters the available documents into classes that with high probability contain relevant documents with respect to their semantic content (i.e. context). Due to that, the WEBSOM architecture is regarded as a prominent candidate for document organization and retrieval.

In this paper, we test the performance of the SOM algorithm by replacing the linear Least Mean Squares adaptation rule with the marginal median for document organization and retrieval. The proposed algorithm has similarities with the batch-map because both of them use the Voronoi sets, that is, the set of feature vectors that have been assigned to each neuron, in order to update the reference vector of the neuron. Its difference lies in the replacement of the averaging procedure employed in the batch-map by the marginal median operator in the proposed variant. However, the proposed algorithm remains an on-line algorithm.

The outline of the paper is as follows: Section 2 provides a brief description of the basic SOM algorithm, its mathematical foundations as well as a brief summary of the algorithm's native drawbacks. Section 3 describes the proposed variant with respect to the updating procedure of the reference vectors, which is based on marginal data ordering. It also contains a description of the two distinct implementations of the proposed algorithm. Section 4 is divided into three subsections: Section 4.1 covers the formation of the two corpora employed in our study and the preprocessing steps taken in order to remove any unwanted information from them. Section 4.2 describes the language model employed to encode the textual data into numerical vectors and Section 4.3 is devoted to word and document clustering. In Section 5, we assess the experimental results by using the MSE curves during the training phase of the proposed algorithm and the basic SOM method and the average recall–precision curves obtained by querying the information organization obtained in the training phase of both systems.

Section snippets

Self-organizing maps

Let us denote by $X$ the set of vector-valued observations, $X ={x_{j} ∈ R^{N_{w}} | x_{j} =(x_{1j},x_{2j},…,x_{N_{w}j})^{T},$ j=1,2,…,N}, where N_w corresponds to the dimensionality of the vectors that encode the N available observations. Let also $W$ denote the set of reference vectors of the neurons, that is, $W ={w_{l} (k)∈ R^{N_{w}},l=1,2,…,L},$ where the parameter k denotes discrete time and L is the number of neurons on the lattice. Finally, let $w_{l} (0)$ be located on a regular lattice that lies on the hyperplane which is determined by the two

Marginal median SOM

Order statistics have played an important role in the statistical data analysis and especially in the robust analysis of data contaminated with outlying observations (Pitas & Venetasanopoulos, 1990). The lack of any obvious and unambiguous extension of ordering multivariate observations has led to several sub-ordering methods such as marginal ordering, reduced (aggregate) ordering, partial ordering and conditional (sequential) ordering. A discussion on these principles can be found in Barnett

Marginal median SOM application to document retrieval

The performance evaluation of the proposed variant against the basic SOM method is described here for document retrieval. The training has been performed on two corpora, namely the Hypergeo corpus (described subsequently) and the Reuters-21578 corpus (Lewis, 1997). The objective is to divide the corpora into contextually related document classes and then query these classes using sample query-documents, to find the closest document class. The major advantage of the SOM approach is that it can

Experimental results

The performance of the MMSOM against the basic SOM method is measured using the MSE between the reference vectors and the document vectors assigned to each neuron in the training phase. Furthermore, the recall–precision performance is measured using query-documents from the test set during the recall phase is used as an indirect measure of the quality of document organization provided by both algorithms. Fig. 9 depicts the MSE curves during the formation of the WCM using the basic SOM

Conclusions

The inherent drawbacks of the SOM algorithm with respect to the treatment of data outliers in the input space and the suboptimal estimation of the class means has given impetus to the development of a SOM variant that utilizes the marginal median and is capable to handle these drawbacks. Two implementations of the SOM variant that employ the multivariate median operator in order to update the reference vectors of the neurons have been discussed. A superior performance of the proposed variant

Acknowledgements

The authors would like to thank their colleagues, G. Albanidis and N. Bassiou, Aristotle University of Thessaloniki, Greece, for their contribution in the formation of the Hypergeo corpus. This work was supported by the European Union IST Project ‘HYPERGEO: Easy and friendly access to geographic information for mobile users' (IST-1999-11641).

References (26)

V. Barnett
The ordering of multivariate data
Journal of the Royal Statistical Society A
(1976)
P. Clarkson et al.
Statistical language modeling using the CMU-Cambridge toolkit
(1997)
E. Erwin et al.
Self-organizing maps: ordering, convergence properties and energy functions
Biological Cybernetics
(1992)
J.-C. Fort et al.
Advantages and drawbacks of the Batch Kohonen algorithm
(2002)
W.B. Frakes et al.
Information retrieval: data structures and algorithms
(1992)
K. Fukunaga
Introduction to statistical pattern recognition
(1990)
S. Haykin
Neural networks: a comprehensive foundation
(1999)
T.S. Huang et al.
A fast two-dimensional median filtering algorithm
IEEE Transactions on Accoustics, Speech and Signal Processing
(1979)
P.J. Huber
Robust statistics
(1981)
J.A. Kangas et al.
March Variants of self-organizing maps
IEEE Transactions on Neural Networks
(1990)

S. Kaski

Dimensionality reduction by random mapping: Fast similarity computation for clustering

(1998)

T. Kohonen

The self-organizing map

Proceedings of the IEEE

(1990)

T. Kohonen

Self organizing maps

(1997)

Cited by (14)

A coarse-to-fine framework to efficiently thwart plagiarism
2011, Pattern Recognition
Citation Excerpt :
Readers are referred to [14] for a brief review on IE. Self-organizing map (SOM), a versatile unsupervised neural network, has been also studied to perform speedy DR due to its computational efficiency [15]. SOM was used to speed up the retrieval process by automatically formulizing a document map [16].
This paper presents a systematic framework using multilevel matching approach for plagiarism detection (PD). A multilevel structure, i.e. document–paragraph–sentence, is used to represent each document. In document and paragraph level, we use traditional dimensionality reduction technique to project high dimensional histograms into latent semantic space. The Earth Mover’s Distance (EMD), instead of exhaustive matching, is employed to retrieve relevant documents, which enables us to markedly shrink the searching domain. Two PD algorithms are designed and implemented to efficiently flag the suspected plagiarized document sources. We conduct extensive experimental verifications including document retrieval, PD, the study of the effects of parameters, and the empirical study of the system response. The results corroborate that the proposed approach is accurate and computationally efficient for performing PD.
A new document representation using term frequency and vectorized graph connectionists with application to document retrieval
2009, Expert Systems with Applications
Citation Excerpt :
There are many attempts to employ SOM for document feature projection to reduce the data dimensionality (Ampazis & Perantonis, 2004; Honkela, Kaski, Lagus, & Kohonen, 1997). SOM has been used for document organization and web mining (Antonio et al., 2008; Georgakis, Kotropoulos, Xafopoulos, & Pitas, 2004). Document clustering and browsing using SOM are introduced in Isa, Kallimani, and Lee (2009), Freeman and Yin (2005).
This paper presents a new document representation with vectorized multiple features including term frequency and term-connection-frequency. A document is represented by undirected and directed graph, respectively. Then terms and vectorized graph connectionists are extracted from the graphs by employing several feature extraction methods. This hybrid document feature representation more accurately reflects the underlying semantics that are difficult to achieve from the currently used term histograms, and it facilitates the matching of complex graph. In application level, we develop a document retrieval system based on self-organizing map (SOM) to speed up the retrieval process. We perform extensive experimental verification, and the results suggest that the proposed method is computationally efficient and accurate for document retrieval.
A granular computing framework for self-organizing maps
2009, Neurocomputing
When using granular computing for problem solving, one can focus on a specific level of understanding without looking at unwanted details of subsequent (more precise) levels. We present a granular computing framework for growing hierarchical self-organizing maps. This approach is ideal since the maps are arranged in a hierarchical manner and each is a complete abstraction of a pattern within data. The framework allows us to precisely define the connections between map levels. Formulating a neuron as a granule, the actions of granule construction and decomposition correspond to the growth and absorption of neurons in the previous model. In addition, we investigate the effects of updating granules with new information on both coarser and finer granules that have a derived relationship. Called bidirectional update propagation, the method ensures pattern consistency among data abstractions. An algorithm for the construction, decomposition, and updating of the granule-based self-organizing map is introduced. With examples, we demonstrate the effectiveness of this framework for abstracting patterns on many levels.
A quickly trainable hybrid SOM-based document organization system
2008, Neurocomputing
The large volume of nowadays document collections has increased the need of fast trainable document organization systems. This paper presents and evaluates a hybrid system to self-organization of massive document collections based on self-organizing map (SOM). The hybrid system uses prototypes generated by a clustering algorithm to train the document maps, thus reducing the training time of large maps. We test the system with k-means and modified leader clustering algorithms. The experiments are carried out with the Reuters-21758 v1.0 and 20 Newsgroup collections. The performance of the system is measured in terms of text categorization effectiveness on test set and training time. Experimental results show that the proposed system generates effective document maps in less time than SOM. However, the hybrid system using k-means generates better document maps than the one using modified leader at the cost of more long training time.
Assessment of self-organizing map variants for clustering with application to redistribution of emotional speech patterns
2007, Neurocomputing
Citation Excerpt :
The MMSOM and the VMSOM treat efficiently the outliers, because they inherit the robustness properties of the OS [23,15]. The MMSOM has been successfully applied to color image quantization [23] and document organization and retrieval [6]. Another recent application of the MMSOM is in grouping and visualization of human endogenous retroviruses [21].
Two well-known variants of the self-organizing map (SOM) that are based on multivariate order statistics are the marginal median SOM and the vector median SOM. In the past, their efficiency was demonstrated for color image quantization. We employ the well-known IRIS and VOWEL data sets and we assess the SOM variants’ performance with respect to the accuracy, the average over all neurons mean squared error between the patterns that were assigned to a neuron and the neuron's weight vector, the Rand index, the $Γ$ statistic, and the overall entropy. All figures of merit favor the marginal median SOM and the vector median SOM against the standard SOM. Based on the aforementioned findings, the marginal median SOM and the vector median SOM are used to redistribute emotional speech patterns from the Danish Emotional Speech database, which were originally classified as being neutral, to the emotional states of hot anger, happiness, sadness, and surprise.
Bayesian self-organizing map for data classification and clustering
2013, International Journal of Wavelets, Multiresolution and Information Processing

View all citing articles on Scopus

¹: Present address: Digital Media Laboratory (DML), Department of Applied Physics and Electronics, Umeå University, Umeå SE-90187, Sweden.

View full text

Marginal median SOM for document organization and retrieval

Abstract

Introduction

Section snippets

Self-organizing maps

Marginal median SOM

Marginal median SOM application to document retrieval

Experimental results

Conclusions

Acknowledgements

The ordering of multivariate data

Journal of the Royal Statistical Society A

Statistical language modeling using the CMU-Cambridge toolkit

Self-organizing maps: ordering, convergence properties and energy functions

Biological Cybernetics

Advantages and drawbacks of the Batch Kohonen algorithm

Information retrieval: data structures and algorithms

Introduction to statistical pattern recognition

Neural networks: a comprehensive foundation

A fast two-dimensional median filtering algorithm

IEEE Transactions on Accoustics, Speech and Signal Processing

Robust statistics

March Variants of self-organizing maps

IEEE Transactions on Neural Networks

Dimensionality reduction by random mapping: Fast similarity computation for clustering

The self-organizing map

Proceedings of the IEEE

Self organizing maps