Abstract
Sammon’s mapping is a powerful non-linear technique that allow us to visualize high dimensional object relationships. It has been applied to a broad range of practical problems and particularly to the visualization of the semantic relations among terms in textual databases. The word maps generated by the Sammon mapping suffer from a low discriminant power due to the well known “curse of dimensionality” and to the unsupervised nature of the algorithm. Fortunately the textual databases provide frequently a manually created classification for a subset of documents that may help to overcome this problem. In this paper we first introduce a modification of the Sammon mapping (SSammon) that enhances the local topology reducing the sensibility to the ’curse of dimensionality’. Next a semi-supervised version is proposed that takes advantage of the a priori categorization of a subset of documents to improve the discriminant power of the word maps generated. The new algorithm has been applied to the challenging problem of word map generation. The experimental results suggest that the new model improves significantly well known unsupervised alternatives.
Similar content being viewed by others
Notes
Available from http://svmlight.joachims.org.
References
Aggarwal, C. C. (2001). Re-designing distance functions and distance-based applications for high dimensional applications. In Proc. of SIGMOD-PODS (Vol. 1, pp. 13–18).
Aggarwal, C. C., & Yu, P. S. (2002). Redefining clustering for high-dimensional applications. IEEE Transactions on Knowledge and Data Engineering, 14(2), 210–225 (March/April).
Aggarwal, C. C., Gates, S. C., & Yu, P. S. (2004). On using partial supervision for text categorization. IEEE Transactions on Knowledge and Data Engineering, 16(2), 245–255.
Backer, S., Naud, A., & Scheunders, P. (1998). Non-linear dimensionality reduction techniques for unsupervised feature extraction. Pattern Recognition Letters, 19, 711–720.
Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval. Wokingham, UK: Addison Wesley.
Beyer, K., Goldstein, J., Ramakrishnan, R., & Shaft, U. (1999). When is “Nearest Neighbor” meaningful?. In Proc. of the international conference on database theory (ICDT). Lecture notes in computer science (Vol. 1540, pp. 217–235). Jerusalem, Israel: Springer.
Bezdek, J. C., & Pal, N. R. (1995). An index of topological preservation for feature extraction. Pattern Recognition, 28(3), 381–391.
Buja, A., Logan, B., Reeds, F., & Shepp, R. (1994). Inequalities and positive default functions arising from a problem in multidimensional scaling. Annals of Statistics, 22, 406–438.
Chapelle, O., Weston, J., & Schölkopf, B. (2003). Cluster kernels for semi-supervised learning. Annual Conference on Neural Information Processing Systems (NIPS), 15.
Chung, Y. M., & Lee, J. Y. (2001) A corpus-based approach to comparative evaluation of statistical term association measures. Journal of the American Society for Information Science and Technology, 52(4), 283–296.
Cox, T. F., & Cox, M. A. A. (2001). Multidimensional scaling (2nd ed.). USA: Chapman & Hall/CRC.
Demartines, P., & Hérault, J. (1996). Curvilinear component analysis: A self-organizing neural network for nonlinear mapping of data sets. IEEE Transactions on Neural Networks, 20, 1–6.
Joachims, T. (2002). Learning to classify text using support vector machines. Methods, theory and algorithms. Boston: Kluwer.
Kaplan, W. (1999). MAXIMA and MINIMA with applications. New York: Wiley.
Kaufman, L., & Rousseeuw, P. J. (1990). Finding groups in data. An introduction to cluster analysis. New York: Wiley.
Kohonen, T. (1995). Self-organizing maps (2nd ed.). Berlin: Springer Verlag.
Kohonen, T., Kaski, S., Lagus, K., Salojarvi, J., Honkela, J., Paatero, V., et al. (2000). Organization of a massive document collection. IEEE Transactions on Neural Networks, 11(3), 574–585.
Kothari, R., & Jain, V. (2003). Learning from labeled and unlabeled data using a minimal number of queries. IEEE Transactions on Neural Networks, 14(6), 1496–1505 (November).
Kraaijveld, M., Mao, J., & Jain, A. (1995). A nonlinear projection method based on kohonen’s topology preserving maps. IEEE Transactions on Neural Networks, 6(3), 548–559 (May).
Lee, J. A., Lendasse, A., & Verleysen, M. (2004). Nonlinear projection with curvilinear distances: Isomap versus curvilinear distance analysis. Neurocomputing, 37, 49–76.
Mao, J., & Jain, A. K. (1995). Artificial neural networks for feature extraction and multivariate data projection. IEEE Transactions on Neural Networks, 6(2), 296–317 (March).
Martín-Merino, M., & Muñoz, A. (2001). Self organizing map and Sammon mapping for asymmetric proximities. LNCS (Vol. 2130, pp. 429–435). Springer.
Martín-Merino, M., & Muñoz, A. (2004a). A new MDS algorithm for textual data analysis. Lecture notes in computer science LNCS-3316 (pp. 860–867). Springer.
Martín-Merino, M., & Muñoz, A. (2004b). A new Sammon algorithm for sparse data visualization. In International Conference on Pattern Recognition (Vol. 1, pp. 477–481) Cambridge, August.
Muñoz, A. (1997). Compound key word generation from document databases using a hierarchical clustering ART model. Journal of Intelligent Data Analysis, 1(1), 25–48.
Pedrycz, W., & Vukovich, G. (2004). Fuzzy clustering with supervision. Pattern Recognition, 37, 1339–1349.
Sammon, J. W. (1969). A nonlinear mapping for data structure analysis. IEEE Transactions on Computers, C-18, 401–409 (May).
Schölkopf, B., & Smola, A. J. (2002). Learning with kernels. Cambridge: MIT Press.
Strehl, A., Ghosh, J., & Mooney, R. (2000). Impact of similarity measures on web-page clustering. In Proceedings of the 17th national conference on artificial intelligence: Workshop of artificial intelligence for Web search (pp. 58–64) Austin, USA (July).
Vapnik, V. N. (1998). Statistical learning theory. New York: Wiley.
Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In Proc. of the 14th international conference on machine learning (pp. 412–420). Nashville, Tennessee, USA (July).
Acknowledgements
Financial support from Junta de Castilla y León grant PON05B06 is gratefully appreciated.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Martín-Merino, M., Blanco, Á. A local semi-supervised Sammon algorithm for textual data visualization. J Intell Inf Syst 33, 23–40 (2009). https://doi.org/10.1007/s10844-008-0056-5
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10844-008-0056-5