Abstract
In modern applications, such as text mining and signal processing, large amounts of categorical data are produced at a high rate and are characterized by association structures changing over time. Multiple correspondence analysis (MCA) is a well established dimension reduction method to explore the associations within a set of categorical variables. A critical step of the MCA algorithm is a singular value decomposition (SVD) or an eigenvalue decomposition (EVD) of a suitably transformed matrix. The high computational and memory requirements of ordinary SVD and EVD make their application impractical on massive or sequential data sets. Several enhanced SVD/EVD approaches have been recently introduced in an effort to overcome these issues. The aim of the present contribution is twofold: (1) to extend MCA to a split-apply-combine framework, that leads to an exact and parallel MCA implementation; (2) to allow for incremental updates (downdates) of existing MCA solutions, which lead to an approximate yet highly accurate solution. For this purpose, two incremental EVD and SVD approaches with desirable properties are revised and embedded in the context of MCA.
Similar content being viewed by others
References
Baglama, J., Reichel, L.: Augmented implicitly restarted Lanczos bidiagonalization methods. Siam. J. Sci. Comput. 27, 19–42 (2007)
Baker, C., Gallivan, K., Van Dooren, P.: Low-rank incremental methods for computing dominant singular subspaces. Linear Algebra Appl. 436(8), 2866–2888 (2012)
Brand M.: Fast online svd revision for lightweigtht recommender systems. In Proceedings of SIAM International Conference on Data Mining, 37–46 (2003)
Brand, M.: Fast low-rank modifications of the thin singular value decomposition. Linear Algebra Appl. 415(1), 20–30 (2006)
Chahlaoui Y., Gallivan K., Van Dooren P.: An incremental method for computing dominant singular spaces, in M. W. Berry (ed.) Computational Information Retrieval, SIAM 53–62 (2001)
Chandrasekaran, S., Manjunth, B.S., Wang, Y.F., Winkeler, J., Zhang, H.: An eigenspace update algorithm for image analysis. Graph. Model Im. Proc. 59(5), 321–332 (1997)
Dean, J., Ghemawat, S.: MapReduce: Simplied Data Processing on Large Clusters. Commun. Acm. 51, 107–113 (2008)
DeGroat, R.D., Roberts, R.: Efficient, numerically stablized rank-one eigenstructure updating. IEEE T Acoust Speech 38(2), 301–316 (1990)
Fidler, S., Skocaj, D., Leonardis, A.: Combining reconstructive and discriminative subspace methods for robust classification and regression by subsampling. IEEE T Pattern Anal. 28(3), 337–350 (2006)
Gentry J.: twitteR: R based Twitter client. http://cran.r-project.org/web/packages/twitteR/ (2011)
Golub, G., van Loan, A.: Matrix Computations. John Hopkins U. Press, Baltimore (1996)
Greenacre, M.J.: Theory and Applications of Correspondence Analysis. Academic Press, London (1984)
Greenacre, M.J.: Correspondence Analysis in Practice, 2nd edn. Chapman and Hall/CRC, Boca Raton (2007)
Greenacre, M., Hastie, T.: Dynamic visualization of statistical learning in the context of high-dimensional textual data. J. Web. Semant. 8, 163–168 (2010)
Gu, M., Eisenstat, S.C.: A stable and efficient algorithm for the rank-one modification of the symmetric eigenproblem. Siam. J. Matrix. Anal. A. 15, 1266–1276 (1994)
Hall, P., Marshall, D., Martin, R.: Adding and subtracting eigenspaces with eigenvalue decomposition and singular value decomposition. Image Vision Comput. 20, 1009–1016 (2002)
Herbster, M., Warmuth, M.K.: Tracking the best linear predictor. J. Mach. Learn Res. 1, 281–309 (2001)
Hu M., Liu B.: Mining and summarizing customer reviews, 10th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 168–177 (2004)
Jackson, D.A.: PROTEST: a Procrustean randomization test of community environment concordance. Ecoscience 2, 297–303 (1995)
Jolliffe, I.T.: Principal Component Analysis, 2nd edn. Springer, Berlin (2002)
Levy, A., Lindenbaum, M.: Sequential Karhunen-Loeve basis extraction. IEEE T Image Process. 9(8), 1371–1374 (2000)
Lin, L., Shyu, M.L.: Weighted association rule mining for video semantic detection. Int. J. Multimed. Data Eng. Manag. 1(1), 37–54 (2010)
Murakami, H., Kumar, B.V.: Efficient calculation of primary images from a set of images. IEEE Trans. Pattern Anal. Mach. Intell. 4(5), 511–515 (1982)
Nenadić O., Greenacre M.J., Correspondence analysis in R, with two- and three-dimensional graphics: the ca package. J Stat Software 20, 1–13 (URL: http://www.jstatsoft.org/v20/i03/) (2007)
Oksanen J., Kindt R., Legendre P., O’Hara B., Simpson G.L., Solymos P. et al.: Vegan: Community ecology package (2008)
Petrović, S., Bašic, B.D., Morin, A., Zupan, B.: Textual features for corpus visualization using correspondence analysis. Intell. Data. Anal. 13(5), 795–813 (2009)
Pham, N.K., Morin, A., Gros, P., Le, Q.T.: Intensive use of correspondence analysis for large scale content-based image retrieval. Stud. Comp. Intell. 292, 57–76 (2010)
Rao, C.R.: Maximum likelihood estimation for the multinomial distribution. Sankhya Indian J. Stat. 18(1), 139–148 (1957)
Ross, D., Lim, J., Lin, R.S., Yang, M.H.: Incremental learning for robust visual tracking. Int. J. Comput. Vis. 77, 125–141 (2008)
Roweis, S.: EM algorithms for PCA and SPCA. Advances in neural information processing systems, pp. 626–632 (1998)
Tipping, M.E., Bishop, C.M.: Probabilistic principal component analysis. J. R. Stat. Soc. Series B Stat. Methodol. 61(3), 611–622 (1999)
Wickam H.: ggplot2: An implementation of the Grammar of Graphics. R package version 0.8.2 (2009)
Wickam, H.: A split-apply-combine strategy for data analysis. J. Stat. Softw. 11(1), 1–29 (2011)
Zhu Q., Lin L., Shyu M.L., Chen S.C.: Effective supervised discretization for classification based on correlation Maximization, 12th IEEE international conference on information reuse and integration (IRI 2011), Las Vegas, Nevada, USA, 390–395 (2011)
Zou, H., Hastie, T., Tibshirani, R.: Sparse principal component analysis. J. Comput. Graph. Stat 15, 265–286 (2006)
Acknowledgments
The authors are grateful to all the anonymous reviewers for their helpful comments and suggestions on an earlier version of this paper. Many thanks are due to Yoshio Takane, George Menexes and Hervé Abdi for their help and constructive feedback
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Iodice D’Enza, A., Markos, A. Low-dimensional tracking of association structures in categorical data. Stat Comput 25, 1009–1022 (2015). https://doi.org/10.1007/s11222-014-9470-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11222-014-9470-4