Abstract
This article presents a unified theory for analysis of components in discrete data, and compares the methods with techniques such as independent component analysis, non-negative matrix factorisation and latent Dirichlet allocation. The main families of algorithms discussed are a variational approximation, Gibbs sampling, and Rao-Blackwellised Gibbs sampling. Applications are presented for voting records from the United States Senate for 2003, and for the Reuters-21578 newswire collection.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Azzopardi, L., Girolami, M., van Risjbergen, K.: Investigating the relationship between language model perplexity and ir precision-recall measures. In: SIGIR 2003: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pp. 369–370 (2003)
Buntine, W., Jakulin, A.: Applying discrete PCA in data analysis. In: UAI-2004, Banff, Canada (2004)
Bingham, E., Kabán, A., Girolami, M.: Topic identification in dynamical text by complexity pursuit. Neural Process. Lett. 17(1), 69–83 (2003)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. Journal of Machine Learning Research 3, 993–1022 (2003)
Buntine, W.L., Perttu, S., Tuulos, V.: Using discrete PCA on web pages. In: Workshop on Statistical Approaches to Web Mining, SAWM 2004 (2004), At ECML 2004
Bernardo, J.M., Smith, A.F.M.: Bayesian Theory. John Wiley, Chichester (1994)
Buntine, W.L.: Variational extensions to EM and multinomial PCA. In: Elomaa, T., Mannila, H., Toivonen, H. (eds.) ECML 2002. LNCS, vol. 2430, p. 23. Springer, Heidelberg (2002)
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison Wesley, Reading (1999)
Canny, J.: GaP: a factor model for discrete data. In: SIGIR 2004, pp. 122–129 (2004)
Casella, G., Berger, R.L.: Statistical Inference. Wadsworth & Brooks/Cole, Belmont (1990)
Clarke, B.S., Barron, A.R.: Jeffrey’s prior is asymptotically least favorable under entropy risk. Journal of Statistical Planning and Inference 41, 37–60 (1994)
Carlin, B.P., Chib, S.: Bayesian model choice via MCMC. Journal of the Royal Statistical Society B 57, 473–484 (1995)
Collins, M., Dasgupta, S., Schapire, R.E.: A generalization of principal component analysis to the exponential family. In: NIPS*13 (2001)
Clinton, J.D., Jackman, S., Rivers, D.: The statistical analysis of roll call voting: A unified approach. American Political Science Review 98(2), 355–370 (2004)
Casella, G., Robert, C.P.: Rao-Blackewellization of sampling schemes. Biometrika 83(1), 81–94 (1996)
Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. Journal of the American Society of Information Science 41(6), 391–407 (1990)
de Leeuw, J.: Principal component analysis of binary data: Applications to roll-call-analysis. Technical Report 364, UCLA Department of Statistics (2003)
Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19(1), 61–74 (1994)
Ghahramani, Z., Beal, M.J.: Propagation algorithms for variational Bayesian learning. In: NIPS, pp. 507–513 (2000)
Gelman, A., Carlin, J.B., Stern, H.S., Rubin, D.B.: Bayesian Data Analysis. Chapman & Hall, Boca Raton (1995)
Gaussier, E., Goutte, C.: Relation between PLSA and NMF and implications. In: SIGIR 2005: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 601–602. ACM Press, New York (2005)
Griffiths, T.L., Steyvers, M.: A probabilistic approach to semantic representation. In: Proc. of the 24th Annual Conference of the Cognitive Science Society (2002)
Griffiths, T.L., Steyvers, M.: Finding scientific topics. PNAS Colloquium (2004)
Hofmann, T., Buhmann, J.M.: Pairwise data clustering by deterministic annealing. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(1), 1–14 (1997)
Hyvärinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. John Wiley & Sons, Chichester (2001)
Holland, P., Laskey, K.B., Leinhardt, S.: Stochastic blockmodels: Some first steps. Social Networks 5, 109–137 (1983)
Hyvärinen, A., Oja, E.: Independent component analysis: algorithms and applications. Neural Netw. 13(4-5), 411–430 (2000)
Hofmann, T.: Probabilistic latent semantic indexing. Research and Development in Information Retrieval, 50–57 (1999)
Jakulin, A., Bratko, I.: Analyzing attribute dependencies. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) PKDD 2003. LNCS, vol. 2838, pp. 229–240. Springer, Heidelberg (2003)
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Joachims, T.: Making large-scale SVM learning practical. In: Schölkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods - Support Vector Learning. MIT Press, Cambridge (1999)
Lee, D., Seung, H.: Learning the parts of objects by non-negative matrix factorization. Nature 401, 788–791 (1999)
Lewis, D.D., Yand, Y., Rose, T.G., Li, F.: Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research 5, 361–397 (2004)
Mardia, K.V., Kent, J.T., Bibby, J.M.: Multivariate Analysis. Academic Press, London (1979)
Minka, T., Lafferty, J.: Expectation-propagation for the generative aspect model. In: UAI-2002, Edmonton (2002)
McCullagh, P., Nelder, J.A.: Generalized Linear Models, 2nd edn. Chapman and Hall, London (1989)
Poole, K.T.: Non-parametric unfolding of binary choice data. Political Analysis 8(3), 211–232 (2000)
Pritchard, J.K., Stephens, M., Donnelly, P.J.: Inference of population structure using multilocus genotype data. Genetics 155, 945–959 (2000)
Pereira, F., Tishby, N., Lee, L.: Distributional clustering of English words. In: Proceedings of ACL 1993 (June 1993)
Ross, S.M.: Introduction to Probability Models, 4th edn. Academic Press, London (1989)
Roweis, S.: EM algorithms for PCA and SPCA. In: Jordan, M.I., Kearns, M.J., Solla, S.A. (eds.) Advances in Neural Information Processing Systems, vol. 10. The MIT Press, Cambridge (1998)
Snijders, T.A.B., Nowicki, K.: Estimation and prediction for stochastic block models for graphs with latent block structure. Journal of Classification 14, 75–100 (1997)
Tipping, M.E., Bishop, C.M.: Probabilistic principal components analysis. J. Roy. Statistical Society B 61(3), 611–622 (1999)
Titterington, D.M.: Some aspects of latent structure analysis (In this volume.). In: Saunders, C., Grobelnik, M., Gunn, S., Shawe-Taylor, J. (eds.) SLSFS 2005. LNCS, vol. 3940, pp. 69–83. Springer, Heidelberg (2006)
van der Heijden, P.G.M., Gilula, Z., van der Ark, L.A.: An extended study into the relationship between correspondence analysis and latent class analysis. Sociological Methodology 29, 147–186 (1999)
Woodbury, M.A., Manton, K.G.: A new procedure for analysis of medical classification. Methods Inf. Med. 21, 210–220 (1982)
Wang, X., Mohanty, N., McCallum, A.: Group and topic discovery from relations and text. In: The 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Workshop on Link Discovery: Issues, Approaches and Applications (LinkKDD 2005), pp. 28–35 (2005)
Yu, K., Yu, S., Tresp, V.: Dirichlet enhanced latent semantic analysis. In: Saul, L.K., Weiss, Y., Bottou, L. (eds.) Proc. of the 10th International Workshop on Artificial Intelligence and Statistics (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Buntine, W., Jakulin, A. (2006). Discrete Component Analysis. In: Saunders, C., Grobelnik, M., Gunn, S., Shawe-Taylor, J. (eds) Subspace, Latent Structure and Feature Selection. SLSFS 2005. Lecture Notes in Computer Science, vol 3940. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11752790_1
Download citation
DOI: https://doi.org/10.1007/11752790_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-34137-6
Online ISBN: 978-3-540-34138-3
eBook Packages: Computer ScienceComputer Science (R0)