Abstract
The process of clustering aims to discover natural groupings, and thus present an overview of the classes (topics) in a collection of documents. In the field of artificial intelligence, this is known as unsupervised machine learning. Extraction of internal structure from document collections in the absence of pre-classified training data, is a challenging task in text-mining due to the high dimensionality of the input data (usually in the form of word-frequency vectors derived from the bag-of-words (BOW) model of document representation). Self Organizing Maps (SOM) represents high-dimensional data in the form of topology preserving two-dimensional projections which can be exploited for creating a natural visualization of data and at the same time to accomplish the task of dimensionality reduction. The feature of emergence which is the generation of complex systems and patterns by the cooperation of multiple elementary interactions provides a way of detecting higher level structures or cluster of clusters within a document corpus. The natural visualization of clusters is investigated in this study (rather than classification/categorization) using Emergent Self-Organized Maps by effectively introducing bigrams. Experiments have been conducted using a limited vocabulary of 925 documents containing 2000 unigrams and 1000 bigrams approximately to analyze the visualization of emergent higher level structures, document relatedness at lower level and at the same time show the presence of micro-clusters.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Nicholas, O.A., Edward, A.F.: Recent Developments in Document Clustering. Department of Computer Science, Virginia Tech., Blacksburg, VA 24060 (2007)
Dumais, S.T., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In: 7th ACM International Conference on Information and Knowledge Management, Bethesda, US, pp. 148–155 (1998)
Salton, G., McGill, M.: Introduction to Modern Information Retrieval. McGraw Hill (1983)
Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. John Wiley & Sons, New York (2000)
Bekkerman, R., Allan, J.: Using Bigrams in Text Categorization. CIIR Technical Report IR-408. University of Massachusetts, Amherst, US (2003)
Weiss, S.M., Apté, C., Damerau, F.J., Johnson, D.E., Oles, F.J., Goetz, T., Hampp, T.: Maximizing text-mining performance. IEEE Intelligent Systems 14(4), 63–69 (1999)
Delen, D., Crossland, M.D.: Seeding the survey and analysis of research literature with text mining. Expert Systems with Applications 34(3), 1707–1720 (2008)
Li, M., Zhang, L.: Multinomial mixture model with feature selection for text clustering. Knowledge-Based Systems 21, 704–708 (2008)
Kohonen, T.: Self Organizing Maps, 3rd edn. Springer, Berlin (2001)
Mingoti, S.A., Lima, J.O.: Comparing SOM neural network with Fuzzy c-means, Kmeans and traditional hierarchical clustering algorithms. European Journal of Operational Research 174, 1742–1759 (2006)
Ultsch, A., Moerchen, F.: ESOM-Maps: Tools for clustering, visualization, and classification with Emergent SOM. Technical Report No. 46, Dept. of Mathematics and Computer Science, University of Marburg, Marburg, Germany (2005)
Yen, G.G., Wu, Z.: Ranked Centroid Projection: A Data Visualization Approach with Self-Organizing Maps. IEEE Transaction on Neural Networks 19(2), 245–259 (2008)
Feng, Z., Bao, J., Shen, J.: Dynamic and Adaptive Self Organizing Maps applied to High Dimensional Large Scale Text Clustering. In: IEEE International Conference on Software Engineering and Service Sciences, pp. 348–351 (2010)
Schapire, R.E., Singer, Y.: BOOSTEXTER: a boosting-based system for text categorization. Machine Learning 39(2/3), 135–168 (2000)
Lewis, D.D.: An evaluation of phrasal and clustered representations on a text categorization task. In: 15th ACM International Conference on Research and Development in Information Retrieval, Kobenhavn, DK, pp. 37–50 (1992)
Mladenić, D., Grobelnik, M.: Word sequences as features in text-learning. In: Seventh Electrotechnical and Computer Science Conference, Ljubljana, SL, pp. 145–148 (1998)
Stopword Removal (Dated February 3, 2009), http://www.fromzerotoseo.com/stopwords-remove/
Stemming (Dated March 27, 2011), http://en.wikipedia.org/wiki/Stemming
Porter’sStemmer (Dated Summer 2005), http://www.comp.lancs.ac.uk/computing/research/stemming/general/porter.htm
Reuter’scorpus (Dated May 14, 2004), http://www.daviddlewis.com/resources/testcollections/reuters21578/
Ultsch, A.: Maps for the Visualization of high-dimensional Data Spaces. In: WSOM 2003, Kyushu, Japan, pp. 225–230 (2003)
Ultsch, A.: Data Mining and Knowledge Discovery with Emergent Self-Organizing Feature Maps for Multivariate Time Series. In: Kohonen Maps, pp. 33–46 (1999)
Ultsch, A.: Self Organizing Neural Networks perform different from statistical k-means clustering. In: GfKl, Basel (1995)
Ultsch, A.: Self-Organizing Neural Networks for Visualization and Classification. In: Conf. Soc. for Information and Classification, Dortmund (1992)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer India Pvt. Ltd.
About this paper
Cite this paper
Singh, P.K., Machavolu, M., Bharti, K., Suda, R. (2012). Analysis of Text Cluster Visualization in Emergent Self Organizing Maps Using Unigrams and Its Variations after Introducing Bigrams. In: Deep, K., Nagar, A., Pant, M., Bansal, J. (eds) Proceedings of the International Conference on Soft Computing for Problem Solving (SocProS 2011) December 20-22, 2011. Advances in Intelligent and Soft Computing, vol 131. Springer, New Delhi. https://doi.org/10.1007/978-81-322-0491-6_89
Download citation
DOI: https://doi.org/10.1007/978-81-322-0491-6_89
Publisher Name: Springer, New Delhi
Print ISBN: 978-81-322-0490-9
Online ISBN: 978-81-322-0491-6
eBook Packages: EngineeringEngineering (R0)