Abstract
Data visualization is aimed at obtaining a graphic representation of high dimensional information. A data projection over a lower dimensional space is pursued, looking for some structure on the projections. Among the several data projection based methods available, the Generative Topographic Mapping (GTM) has become an important probabilistic framework to model data. The application to document data requires a change in the original (Gaussian) model in order to consider binary or multinomial variables. There have been several modifications on GTM to consider this kind of data, but the resulting latent projections are all scattered on the visualization plane. A document visualization method is proposed in this paper, based on a generative probabilistic model consisting of a mixture of Zero-inflated Poisson distributions. The performance of the method is evaluated in terms of cluster forming for the latent projections with an index based on Fisher’s classifier, and the topology preservation capability is measured with the Sammon’s stress error. A comparison with the GTM implementation with Gaussian, multinomial and Poisson distributions and with a Latent Dirichlet model is presented, observing a greater performance for the proposed method. A graphic presentation of the projections is also provided, showing the advantage of the developed method in terms of visualization and class separation. A detailed analysis of some documents projected on the latent representation showed that most of the documents appearing away from the corresponding cluster could be identified as outliers.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Alvarez D, Hidalgo H (2006) ZIP and data document visualization. In: Proceedings of workshop on text mining in sixth SIAM international conference on data mining, SIAM, Bethesda
Bishop CM, Svénsen M, Williams CKI (1998) GTM: the generative topographic mapping. Neural Comput 10(1): 215–235
Blei DM, Ng AY, Jordan MI (2003) Latent Dirchlet allocation. J Mach Learn Res 3: 993–1022
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm (with discussion). J R Stat Soc Series B Stat Methodol 39(1): 1–38
Dobson A (2002) An introduction to generalized linear models, 2nd edn. Chapman and Hall, London
Duda RO, Hart PE, Stork DG (2001) Pattern classification, 2nd edn. Wiley & Sons
Girolami M (2001) The topographic organization and visualization of binary data using multivariate-Bernoulli latent variable models. IEEE Trans Neural Netw 12(6): 1367–1374
Halkidi M, Batistakis Y, Vazirgiannis M (2001) On clustering validation techniques. Intell Inf Syst J 17(2–3): 107–145
Honkela T, Kaski S, Lagus K, Kohonen T (1996) Exploration of full-text databases with self-organizing maps. In: Proceedings of the IEEE International Conference on Neural Networks (ICNN96), IEEE Press, pp 56–61
Kabán A, Girolami M (2001) A combined latent class and trait model for the analysis and visualization of discrete Data. IEEE Trans Pattern Anal Mach Intell 23(8): 859–872
Kaski S, Honkela T, Lagus K, Kohonen T (1996) Creating an order in digital libraries with self-organizing maps. In: Proceedings of World Congress on Neural Networks (WCNN’96), Lawrence Erlbaum and INNS Press, pp 814–817
Kohonen T (1989) Self-organization and associative memory. Springer
Kohonen T, Kaski S, Lagus K, Honkela T (1996) Very large two-level SOM for the browsing of newsgroups. In: Proceedings of international conference on artificial neural networks (ICANN96), LNCS 1112, Springer, pp 269–274
Kruskal JB (1964) Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika 29(1): 1–27
Lagus K, Honkela T, Kaski S, Kohonen T (1996) Self-organizing maps of document collections: a new approach to interactive exploration. In: Proceedings of the second international conference on knowledge discovery and data mining, AAAI Press, Menlo Park, pp 238–243
Lambert D (1992) Zero-inflated Poisson regression with an application to defects in manufacturing. Technometrics 34(1): 1–13
Li J, Zha H (2006) Two-way Poisson mixture models for simultaneous document classification and word clustering. Comput Stat Data Anal 50(1): 163–180
Mao J, Jain AK (1995) Artificial neural networks for feature extraction and multivariate data projection. IEEE Trans Neural Netw (6):2, 296–317
Miikkulainen R (1993) Subsymbolic natural language processing: an integrated model of scripts, lexicon and memory. MIT Press
Porter MF (1980) An algorithm for suffix stripping. Program 14(3): 130–137
Ritter H, Kohonen T (1989) Self organizing semantic maps. Biol Cybern 61: 241–254
Salton G, McGill MJ (1983). Introduction to modern information retrieval. McGraw-Hill
Sammon JW (1969) A nonlinear mapping for data structure analysis. IEEE Trans Comput 18: 404–409
Tinǒ P, Nabney IT (2002) Hierarchical GTM: constructing localized nonlinear projection manifolds in a principled way. IEEE Trans Pattern Anal Mach Intell 24(5): 639–656
Vellido A, Lisboa P (2006) Handling outliers in brain tumor MRS data analysis through robust topographic mapping. Comput Biol Med 10(36): 1049–1063
Wedel M, Desarbo WS, Bult JR, Ramaswamy V (1993) A latent class Poisson regression model for heterogeneous count data. J Appl Econom 8: 397–411
Yang J, Zhang BT (2001) Customer data mining and visualization by generative topographic mapping methods. In: Proceedings of the international workshop on visual data mining, LNAI 2168, Springer, Freiburg, pp 55–66
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: R. Bayardo.
Rights and permissions
About this article
Cite this article
Alvarez, D., Hidalgo, H. Document analysis and visualization with zero-inflated poisson. Data Min Knowl Disc 19, 1–23 (2009). https://doi.org/10.1007/s10618-009-0127-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-009-0127-4