Marginal median SOM for document organization and retrieval
Introduction
Due to their wide range of applications, artificial neural networks (ANN) have been an active research area for the past three decades (Haykin, 1999). A large variety of learning algorithms (i.e. error-correction, memory-based, Hebbian, Boltzmann machines, supervised or unsupervised) have been evolved and being employed in ANNs. A further categorization divides the network architectures into three distinct categories: feedforward, feedbackward, and competitive (Haykin, 1999).
The self-organizing maps (SOMs) or Kohonen's feature maps are feedforward, competitive ANN that employ a layer of input neurons and a single computational layer (Kohonen, 1997, Kohonen, 1990). The neurons on the computational layer are fully connected to the input layer and are arranged on a N-dimensional lattice. Low-dimensional grids, usually two dimensional (2D) or 3D, have prominent visualization properties, and therefore, are employed on the visualization of high-dimensional data. In this paper, we shall use the SOM algorithm to cluster contextually similar documents into classes. Therefore, we shall focus on the 2D lattice in order to visualize the resulting classes on the plane. For the 2D lattice, the computational layer can have either a hexagonal or orthogonal topology. In hexagonal lattices, each neuron has six equal-distant neighbors, whereas orthogonal lattices can be either four- or eight-connected. As for the competitive nature of the algorithm, this is expressed by the fact that only the neuron which is ‘closer’ to the input feature vector with respect to a given metric as well as its neighbors are updated every time a new feature is presented to the ANN.
The SOMs are capable of forming a nonlinear transformation or mapping from an arbitrary dimensional data manifold, the so-called input space, onto the low-dimensional lattice (Haykin, 1999, Kohonen, 1997). The algorithm takes into consideration the relations between the input feature vectors and computes a set of reference vectors in the output space that provide an efficient vector quantization of the input space. Moreover, the density of neurons, i.e. the number of neurons in a small volume of the input space matches the probability density function (pdf) of the feature vectors. Generally, the approximation error is measured by the Mean Square Error (MSE). In doing so, the algorithm employs a linear Least Mean Square adaptation rule for updating the reference vector of each neuron. When the training procedure is led to equilibrium it results to a partition of the domain of the vector-valued observations called Voronoi tessellation (Kohonen, 1997, Ritter and Schulten, 1988). The convergence properties of SOMs are studied in Ritter and Schulten, 1988, Erwin et al., 1992.
A complete and thorough investigation regarding the available variants of the SOM algorithm can be found in Kohonen, 1997, Kangas et al., 1990. One such frequently used variant is the batch-map. The batch-map estimates the sample mean of the feature vectors that are assigned to each reference vector and subsequently smooths the sample mean to yield an updated reference vector. A trade off is made between the speed and degradation of the clustering accuracy (Fort, Letremy, & Cottrell, 2002). The batch-map is faster than the on-line SOM algorithm. However, it produces unbalanced classes of inferior quality than those produced by on-line SOM algorithm. In the experiments reported in Section 5, the precision rate of the batch SOM algorithm is always less than that of the on-line SOM for all recall rates.
The ability of the SOM algorithm to produce spatially organized representations of the input space can be utilized in document organization, where organization refers to the representation and storage of the available data. In this paper, we exploit this algorithm also for document retrieval. Retrieval refers to the exploration of the organized document repository through specific user-defined queries (Yates & Neto, 1999).
Prior to the document indexing, due to the nature of the SOM algorithm the available textual data have to be transcribed into a numerical form. Among the three widely accepted encoding models that are used by the information retrieval (IR) community (Yates & Neto, 1999), namely the boolean, the probabilistic, and the vector space model, the latter model is the most appropriate for the SOM algorithm. In the vector space model, the documents and the queries used in the training and the retrieval phase are represented by high-dimensional vectors. Each vector component corresponds to a different word type (i.e. a distinct word appearance) in the document collection (also called corpus). Subsequently, the documents can be easily clustered into contextually related collections by using any distance metric, such as the Euclidean, the Mahalanobis, the city-block, etc. Such a clustering is based on the assumption that the contextual correlation between the documents continues to exist in their vectorial representation. The degree of similarity between a given query and the documents is measured using the same distance metric and the documents marked as being relevant to the query can be ranked in a decreasing order of similarity according to this distance metric (Yates & Neto, 1999).
An architecture based on the SOM algorithm that is capable of clustering documents according to their semantic similarities is the so-called WEBSOM architecture (Kohonen, 1998, Kohonen et al., 1999, Kohonen et al., 2000). The WEBSOM consists of two distinct layers where the SOM algorithm is applied. The first layer is used to cluster the words found in the available training documents into semantically related collections. The second layer, which is activated after the completion of the first layer, clusters the available documents into classes that with high probability contain relevant documents with respect to their semantic content (i.e. context). Due to that, the WEBSOM architecture is regarded as a prominent candidate for document organization and retrieval.
In this paper, we test the performance of the SOM algorithm by replacing the linear Least Mean Squares adaptation rule with the marginal median for document organization and retrieval. The proposed algorithm has similarities with the batch-map because both of them use the Voronoi sets, that is, the set of feature vectors that have been assigned to each neuron, in order to update the reference vector of the neuron. Its difference lies in the replacement of the averaging procedure employed in the batch-map by the marginal median operator in the proposed variant. However, the proposed algorithm remains an on-line algorithm.
The outline of the paper is as follows: Section 2 provides a brief description of the basic SOM algorithm, its mathematical foundations as well as a brief summary of the algorithm's native drawbacks. Section 3 describes the proposed variant with respect to the updating procedure of the reference vectors, which is based on marginal data ordering. It also contains a description of the two distinct implementations of the proposed algorithm. Section 4 is divided into three subsections: Section 4.1 covers the formation of the two corpora employed in our study and the preprocessing steps taken in order to remove any unwanted information from them. Section 4.2 describes the language model employed to encode the textual data into numerical vectors and Section 4.3 is devoted to word and document clustering. In Section 5, we assess the experimental results by using the MSE curves during the training phase of the proposed algorithm and the basic SOM method and the average recall–precision curves obtained by querying the information organization obtained in the training phase of both systems.
Section snippets
Self-organizing maps
Let us denote by the set of vector-valued observations, j=1,2,…,N}, where Nw corresponds to the dimensionality of the vectors that encode the N available observations. Let also denote the set of reference vectors of the neurons, that is, where the parameter k denotes discrete time and L is the number of neurons on the lattice. Finally, let be located on a regular lattice that lies on the hyperplane which is determined by the two
Marginal median SOM
Order statistics have played an important role in the statistical data analysis and especially in the robust analysis of data contaminated with outlying observations (Pitas & Venetasanopoulos, 1990). The lack of any obvious and unambiguous extension of ordering multivariate observations has led to several sub-ordering methods such as marginal ordering, reduced (aggregate) ordering, partial ordering and conditional (sequential) ordering. A discussion on these principles can be found in Barnett
Marginal median SOM application to document retrieval
The performance evaluation of the proposed variant against the basic SOM method is described here for document retrieval. The training has been performed on two corpora, namely the Hypergeo corpus (described subsequently) and the Reuters-21578 corpus (Lewis, 1997). The objective is to divide the corpora into contextually related document classes and then query these classes using sample query-documents, to find the closest document class. The major advantage of the SOM approach is that it can
Experimental results
The performance of the MMSOM against the basic SOM method is measured using the MSE between the reference vectors and the document vectors assigned to each neuron in the training phase. Furthermore, the recall–precision performance is measured using query-documents from the test set during the recall phase is used as an indirect measure of the quality of document organization provided by both algorithms. Fig. 9 depicts the MSE curves during the formation of the WCM using the basic SOM
Conclusions
The inherent drawbacks of the SOM algorithm with respect to the treatment of data outliers in the input space and the suboptimal estimation of the class means has given impetus to the development of a SOM variant that utilizes the marginal median and is capable to handle these drawbacks. Two implementations of the SOM variant that employ the multivariate median operator in order to update the reference vectors of the neurons have been discussed. A superior performance of the proposed variant
Acknowledgements
The authors would like to thank their colleagues, G. Albanidis and N. Bassiou, Aristotle University of Thessaloniki, Greece, for their contribution in the formation of the Hypergeo corpus. This work was supported by the European Union IST Project ‘HYPERGEO: Easy and friendly access to geographic information for mobile users' (IST-1999-11641).
References (26)
The ordering of multivariate data
Journal of the Royal Statistical Society A
(1976)- et al.
Statistical language modeling using the CMU-Cambridge toolkit
(1997) - et al.
Self-organizing maps: ordering, convergence properties and energy functions
Biological Cybernetics
(1992) - et al.
Advantages and drawbacks of the Batch Kohonen algorithm
(2002) - et al.
Information retrieval: data structures and algorithms
(1992) Introduction to statistical pattern recognition
(1990)Neural networks: a comprehensive foundation
(1999)- et al.
A fast two-dimensional median filtering algorithm
IEEE Transactions on Accoustics, Speech and Signal Processing
(1979) Robust statistics
(1981)- et al.
March Variants of self-organizing maps
IEEE Transactions on Neural Networks
(1990)
Dimensionality reduction by random mapping: Fast similarity computation for clustering
The self-organizing map
Proceedings of the IEEE
Self organizing maps
Cited by (14)
A coarse-to-fine framework to efficiently thwart plagiarism
2011, Pattern RecognitionCitation Excerpt :Readers are referred to [14] for a brief review on IE. Self-organizing map (SOM), a versatile unsupervised neural network, has been also studied to perform speedy DR due to its computational efficiency [15]. SOM was used to speed up the retrieval process by automatically formulizing a document map [16].
A new document representation using term frequency and vectorized graph connectionists with application to document retrieval
2009, Expert Systems with ApplicationsCitation Excerpt :There are many attempts to employ SOM for document feature projection to reduce the data dimensionality (Ampazis & Perantonis, 2004; Honkela, Kaski, Lagus, & Kohonen, 1997). SOM has been used for document organization and web mining (Antonio et al., 2008; Georgakis, Kotropoulos, Xafopoulos, & Pitas, 2004). Document clustering and browsing using SOM are introduced in Isa, Kallimani, and Lee (2009), Freeman and Yin (2005).
A granular computing framework for self-organizing maps
2009, NeurocomputingA quickly trainable hybrid SOM-based document organization system
2008, NeurocomputingAssessment of self-organizing map variants for clustering with application to redistribution of emotional speech patterns
2007, NeurocomputingCitation Excerpt :The MMSOM and the VMSOM treat efficiently the outliers, because they inherit the robustness properties of the OS [23,15]. The MMSOM has been successfully applied to color image quantization [23] and document organization and retrieval [6]. Another recent application of the MMSOM is in grouping and visualization of human endogenous retroviruses [21].
Bayesian self-organizing map for data classification and clustering
2013, International Journal of Wavelets, Multiresolution and Information Processing
- 1
Present address: Digital Media Laboratory (DML), Department of Applied Physics and Electronics, Umeå University, Umeå SE-90187, Sweden.