Document analysis and visualization with zero-inflated poisson

Alvarez, Dora; Hidalgo, Hugo

doi:10.1007/s10618-009-0127-4

Document analysis and visualization with zero-inflated poisson

Published: 14 February 2009

Volume 19, pages 1–23, (2009)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Dora Alvarez¹ &
Hugo Hidalgo¹

343 Accesses
6 Citations
Explore all metrics

Abstract

Data visualization is aimed at obtaining a graphic representation of high dimensional information. A data projection over a lower dimensional space is pursued, looking for some structure on the projections. Among the several data projection based methods available, the Generative Topographic Mapping (GTM) has become an important probabilistic framework to model data. The application to document data requires a change in the original (Gaussian) model in order to consider binary or multinomial variables. There have been several modifications on GTM to consider this kind of data, but the resulting latent projections are all scattered on the visualization plane. A document visualization method is proposed in this paper, based on a generative probabilistic model consisting of a mixture of Zero-inflated Poisson distributions. The performance of the method is evaluated in terms of cluster forming for the latent projections with an index based on Fisher’s classifier, and the topology preservation capability is measured with the Sammon’s stress error. A comparison with the GTM implementation with Gaussian, multinomial and Poisson distributions and with a Latent Dirichlet model is presented, observing a greater performance for the proposed method. A graphic presentation of the projections is also provided, showing the advantage of the developed method in terms of visualization and class separation. A detailed analysis of some documents projected on the latent representation showed that most of the documents appearing away from the corresponding cluster could be identified as outliers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Artificial Intelligence

References

Alvarez D, Hidalgo H (2006) ZIP and data document visualization. In: Proceedings of workshop on text mining in sixth SIAM international conference on data mining, SIAM, Bethesda
Bishop CM, Svénsen M, Williams CKI (1998) GTM: the generative topographic mapping. Neural Comput 10(1): 215–235
Article Google Scholar
Blei DM, Ng AY, Jordan MI (2003) Latent Dirchlet allocation. J Mach Learn Res 3: 993–1022
Article MATH Google Scholar
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm (with discussion). J R Stat Soc Series B Stat Methodol 39(1): 1–38
MATH MathSciNet Google Scholar
Dobson A (2002) An introduction to generalized linear models, 2nd edn. Chapman and Hall, London
MATH Google Scholar
Duda RO, Hart PE, Stork DG (2001) Pattern classification, 2nd edn. Wiley & Sons
Girolami M (2001) The topographic organization and visualization of binary data using multivariate-Bernoulli latent variable models. IEEE Trans Neural Netw 12(6): 1367–1374
Article Google Scholar
Halkidi M, Batistakis Y, Vazirgiannis M (2001) On clustering validation techniques. Intell Inf Syst J 17(2–3): 107–145
Article MATH Google Scholar
Honkela T, Kaski S, Lagus K, Kohonen T (1996) Exploration of full-text databases with self-organizing maps. In: Proceedings of the IEEE International Conference on Neural Networks (ICNN96), IEEE Press, pp 56–61
Kabán A, Girolami M (2001) A combined latent class and trait model for the analysis and visualization of discrete Data. IEEE Trans Pattern Anal Mach Intell 23(8): 859–872
Article Google Scholar
Kaski S, Honkela T, Lagus K, Kohonen T (1996) Creating an order in digital libraries with self-organizing maps. In: Proceedings of World Congress on Neural Networks (WCNN’96), Lawrence Erlbaum and INNS Press, pp 814–817
Kohonen T (1989) Self-organization and associative memory. Springer
Kohonen T, Kaski S, Lagus K, Honkela T (1996) Very large two-level SOM for the browsing of newsgroups. In: Proceedings of international conference on artificial neural networks (ICANN96), LNCS 1112, Springer, pp 269–274
Kruskal JB (1964) Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika 29(1): 1–27
Article MATH MathSciNet Google Scholar
Lagus K, Honkela T, Kaski S, Kohonen T (1996) Self-organizing maps of document collections: a new approach to interactive exploration. In: Proceedings of the second international conference on knowledge discovery and data mining, AAAI Press, Menlo Park, pp 238–243
Lambert D (1992) Zero-inflated Poisson regression with an application to defects in manufacturing. Technometrics 34(1): 1–13
Article MATH Google Scholar
Li J, Zha H (2006) Two-way Poisson mixture models for simultaneous document classification and word clustering. Comput Stat Data Anal 50(1): 163–180
Article MATH MathSciNet Google Scholar
Mao J, Jain AK (1995) Artificial neural networks for feature extraction and multivariate data projection. IEEE Trans Neural Netw (6):2, 296–317
Miikkulainen R (1993) Subsymbolic natural language processing: an integrated model of scripts, lexicon and memory. MIT Press
Porter MF (1980) An algorithm for suffix stripping. Program 14(3): 130–137
Google Scholar
Ritter H, Kohonen T (1989) Self organizing semantic maps. Biol Cybern 61: 241–254
Article Google Scholar
Salton G, McGill MJ (1983). Introduction to modern information retrieval. McGraw-Hill
Sammon JW (1969) A nonlinear mapping for data structure analysis. IEEE Trans Comput 18: 404–409
Article Google Scholar
Tinǒ P, Nabney IT (2002) Hierarchical GTM: constructing localized nonlinear projection manifolds in a principled way. IEEE Trans Pattern Anal Mach Intell 24(5): 639–656
Article Google Scholar
Vellido A, Lisboa P (2006) Handling outliers in brain tumor MRS data analysis through robust topographic mapping. Comput Biol Med 10(36): 1049–1063
Article Google Scholar
Wedel M, Desarbo WS, Bult JR, Ramaswamy V (1993) A latent class Poisson regression model for heterogeneous count data. J Appl Econom 8: 397–411
Article Google Scholar
Yang J, Zhang BT (2001) Customer data mining and visualization by generative topographic mapping methods. In: Proceedings of the international workshop on visual data mining, LNAI 2168, Springer, Freiburg, pp 55–66

Download references

Author information

Authors and Affiliations

Centro de Investigación y de Educación Superior de Ensenada (CICESE), Km. 107 Carr. Tijuana-Eda, Ensenada, 22860, Mexico
Dora Alvarez & Hugo Hidalgo

Authors

Dora Alvarez
View author publications
You can also search for this author in PubMed Google Scholar
Hugo Hidalgo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hugo Hidalgo.

Additional information

Responsible editor: R. Bayardo.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Alvarez, D., Hidalgo, H. Document analysis and visualization with zero-inflated poisson. Data Min Knowl Disc 19, 1–23 (2009). https://doi.org/10.1007/s10618-009-0127-4

Download citation

Received: 09 December 2007
Accepted: 26 January 2009
Published: 14 February 2009
Issue Date: August 2009
DOI: https://doi.org/10.1007/s10618-009-0127-4

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Document analysis and visualization with zero-inflated poisson

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Conventional displays of structures in data compared with interactive projection-based clustering (IPBC)

Visual Data Analysis

Visualization of Zoomable 2D Projections on the Web

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Document analysis and visualization with zero-inflated poisson

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Conventional displays of structures in data compared with interactive projection-based clustering (IPBC)

Visual Data Analysis

Visualization of Zoomable 2D Projections on the Web

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation