Skip to main content
Log in

Document analysis and visualization with zero-inflated poisson

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Data visualization is aimed at obtaining a graphic representation of high dimensional information. A data projection over a lower dimensional space is pursued, looking for some structure on the projections. Among the several data projection based methods available, the Generative Topographic Mapping (GTM) has become an important probabilistic framework to model data. The application to document data requires a change in the original (Gaussian) model in order to consider binary or multinomial variables. There have been several modifications on GTM to consider this kind of data, but the resulting latent projections are all scattered on the visualization plane. A document visualization method is proposed in this paper, based on a generative probabilistic model consisting of a mixture of Zero-inflated Poisson distributions. The performance of the method is evaluated in terms of cluster forming for the latent projections with an index based on Fisher’s classifier, and the topology preservation capability is measured with the Sammon’s stress error. A comparison with the GTM implementation with Gaussian, multinomial and Poisson distributions and with a Latent Dirichlet model is presented, observing a greater performance for the proposed method. A graphic presentation of the projections is also provided, showing the advantage of the developed method in terms of visualization and class separation. A detailed analysis of some documents projected on the latent representation showed that most of the documents appearing away from the corresponding cluster could be identified as outliers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Alvarez D, Hidalgo H (2006) ZIP and data document visualization. In: Proceedings of workshop on text mining in sixth SIAM international conference on data mining, SIAM, Bethesda

  • Bishop CM, Svénsen M, Williams CKI (1998) GTM: the generative topographic mapping. Neural Comput 10(1): 215–235

    Article  Google Scholar 

  • Blei DM, Ng AY, Jordan MI (2003) Latent Dirchlet allocation. J Mach Learn Res 3: 993–1022

    Article  MATH  Google Scholar 

  • Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm (with discussion). J R Stat Soc Series B Stat Methodol 39(1): 1–38

    MATH  MathSciNet  Google Scholar 

  • Dobson A (2002) An introduction to generalized linear models, 2nd edn. Chapman and Hall, London

    MATH  Google Scholar 

  • Duda RO, Hart PE, Stork DG (2001) Pattern classification, 2nd edn. Wiley & Sons

  • Girolami M (2001) The topographic organization and visualization of binary data using multivariate-Bernoulli latent variable models. IEEE Trans Neural Netw 12(6): 1367–1374

    Article  Google Scholar 

  • Halkidi M, Batistakis Y, Vazirgiannis M (2001) On clustering validation techniques. Intell Inf Syst J 17(2–3): 107–145

    Article  MATH  Google Scholar 

  • Honkela T, Kaski S, Lagus K, Kohonen T (1996) Exploration of full-text databases with self-organizing maps. In: Proceedings of the IEEE International Conference on Neural Networks (ICNN96), IEEE Press, pp 56–61

  • Kabán A, Girolami M (2001) A combined latent class and trait model for the analysis and visualization of discrete Data. IEEE Trans Pattern Anal Mach Intell 23(8): 859–872

    Article  Google Scholar 

  • Kaski S, Honkela T, Lagus K, Kohonen T (1996) Creating an order in digital libraries with self-organizing maps. In: Proceedings of World Congress on Neural Networks (WCNN’96), Lawrence Erlbaum and INNS Press, pp 814–817

  • Kohonen T (1989) Self-organization and associative memory. Springer

  • Kohonen T, Kaski S, Lagus K, Honkela T (1996) Very large two-level SOM for the browsing of newsgroups. In: Proceedings of international conference on artificial neural networks (ICANN96), LNCS 1112, Springer, pp 269–274

  • Kruskal JB (1964) Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika 29(1): 1–27

    Article  MATH  MathSciNet  Google Scholar 

  • Lagus K, Honkela T, Kaski S, Kohonen T (1996) Self-organizing maps of document collections: a new approach to interactive exploration. In: Proceedings of the second international conference on knowledge discovery and data mining, AAAI Press, Menlo Park, pp 238–243

  • Lambert D (1992) Zero-inflated Poisson regression with an application to defects in manufacturing. Technometrics 34(1): 1–13

    Article  MATH  Google Scholar 

  • Li J, Zha H (2006) Two-way Poisson mixture models for simultaneous document classification and word clustering. Comput Stat Data Anal 50(1): 163–180

    Article  MATH  MathSciNet  Google Scholar 

  • Mao J, Jain AK (1995) Artificial neural networks for feature extraction and multivariate data projection. IEEE Trans Neural Netw (6):2, 296–317

  • Miikkulainen R (1993) Subsymbolic natural language processing: an integrated model of scripts, lexicon and memory. MIT Press

  • Porter MF (1980) An algorithm for suffix stripping. Program 14(3): 130–137

    Google Scholar 

  • Ritter H, Kohonen T (1989) Self organizing semantic maps. Biol Cybern 61: 241–254

    Article  Google Scholar 

  • Salton G, McGill MJ (1983). Introduction to modern information retrieval. McGraw-Hill

  • Sammon JW (1969) A nonlinear mapping for data structure analysis. IEEE Trans Comput 18: 404–409

    Article  Google Scholar 

  • Tinǒ P, Nabney IT (2002) Hierarchical GTM: constructing localized nonlinear projection manifolds in a principled way. IEEE Trans Pattern Anal Mach Intell 24(5): 639–656

    Article  Google Scholar 

  • Vellido A, Lisboa P (2006) Handling outliers in brain tumor MRS data analysis through robust topographic mapping. Comput Biol Med 10(36): 1049–1063

    Article  Google Scholar 

  • Wedel M, Desarbo WS, Bult JR, Ramaswamy V (1993) A latent class Poisson regression model for heterogeneous count data. J Appl Econom 8: 397–411

    Article  Google Scholar 

  • Yang J, Zhang BT (2001) Customer data mining and visualization by generative topographic mapping methods. In: Proceedings of the international workshop on visual data mining, LNAI 2168, Springer, Freiburg, pp 55–66

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hugo Hidalgo.

Additional information

Responsible editor: R. Bayardo.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Alvarez, D., Hidalgo, H. Document analysis and visualization with zero-inflated poisson. Data Min Knowl Disc 19, 1–23 (2009). https://doi.org/10.1007/s10618-009-0127-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-009-0127-4

Keywords

Navigation