Skip to main content
Log in

Low-dimensional tracking of association structures in categorical data

  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

In modern applications, such as text mining and signal processing, large amounts of categorical data are produced at a high rate and are characterized by association structures changing over time. Multiple correspondence analysis (MCA) is a well established dimension reduction method to explore the associations within a set of categorical variables. A critical step of the MCA algorithm is a singular value decomposition (SVD) or an eigenvalue decomposition (EVD) of a suitably transformed matrix. The high computational and memory requirements of ordinary SVD and EVD make their application impractical on massive or sequential data sets. Several enhanced SVD/EVD approaches have been recently introduced in an effort to overcome these issues. The aim of the present contribution is twofold: (1) to extend MCA to a split-apply-combine framework, that leads to an exact and parallel MCA implementation; (2) to allow for incremental updates (downdates) of existing MCA solutions, which lead to an approximate yet highly accurate solution. For this purpose, two incremental EVD and SVD approaches with desirable properties are revised and embedded in the context of MCA.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  • Baglama, J., Reichel, L.: Augmented implicitly restarted Lanczos bidiagonalization methods. Siam. J. Sci. Comput. 27, 19–42 (2007)

    Article  MathSciNet  Google Scholar 

  • Baker, C., Gallivan, K., Van Dooren, P.: Low-rank incremental methods for computing dominant singular subspaces. Linear Algebra Appl. 436(8), 2866–2888 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  • Brand M.: Fast online svd revision for lightweigtht recommender systems. In Proceedings of SIAM International Conference on Data Mining, 37–46 (2003)

  • Brand, M.: Fast low-rank modifications of the thin singular value decomposition. Linear Algebra Appl. 415(1), 20–30 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  • Chahlaoui Y., Gallivan K., Van Dooren P.: An incremental method for computing dominant singular spaces, in M. W. Berry (ed.) Computational Information Retrieval, SIAM 53–62 (2001)

  • Chandrasekaran, S., Manjunth, B.S., Wang, Y.F., Winkeler, J., Zhang, H.: An eigenspace update algorithm for image analysis. Graph. Model Im. Proc. 59(5), 321–332 (1997)

    Article  Google Scholar 

  • Dean, J., Ghemawat, S.: MapReduce: Simplied Data Processing on Large Clusters. Commun. Acm. 51, 107–113 (2008)

    Article  Google Scholar 

  • DeGroat, R.D., Roberts, R.: Efficient, numerically stablized rank-one eigenstructure updating. IEEE T Acoust Speech 38(2), 301–316 (1990)

    Article  Google Scholar 

  • Fidler, S., Skocaj, D., Leonardis, A.: Combining reconstructive and discriminative subspace methods for robust classification and regression by subsampling. IEEE T Pattern Anal. 28(3), 337–350 (2006)

    Article  Google Scholar 

  • Gentry J.: twitteR: R based Twitter client. http://cran.r-project.org/web/packages/twitteR/ (2011)

  • Golub, G., van Loan, A.: Matrix Computations. John Hopkins U. Press, Baltimore (1996)

    MATH  Google Scholar 

  • Greenacre, M.J.: Theory and Applications of Correspondence Analysis. Academic Press, London (1984)

    MATH  Google Scholar 

  • Greenacre, M.J.: Correspondence Analysis in Practice, 2nd edn. Chapman and Hall/CRC, Boca Raton (2007)

    Book  MATH  Google Scholar 

  • Greenacre, M., Hastie, T.: Dynamic visualization of statistical learning in the context of high-dimensional textual data. J. Web. Semant. 8, 163–168 (2010)

    Article  Google Scholar 

  • Gu, M., Eisenstat, S.C.: A stable and efficient algorithm for the rank-one modification of the symmetric eigenproblem. Siam. J. Matrix. Anal. A. 15, 1266–1276 (1994)

    Article  MathSciNet  MATH  Google Scholar 

  • Hall, P., Marshall, D., Martin, R.: Adding and subtracting eigenspaces with eigenvalue decomposition and singular value decomposition. Image Vision Comput. 20, 1009–1016 (2002)

    Article  Google Scholar 

  • Herbster, M., Warmuth, M.K.: Tracking the best linear predictor. J. Mach. Learn Res. 1, 281–309 (2001)

    MathSciNet  MATH  Google Scholar 

  • Hu M., Liu B.: Mining and summarizing customer reviews, 10th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 168–177 (2004)

  • Jackson, D.A.: PROTEST: a Procrustean randomization test of community environment concordance. Ecoscience 2, 297–303 (1995)

    Google Scholar 

  • Jolliffe, I.T.: Principal Component Analysis, 2nd edn. Springer, Berlin (2002)

    MATH  Google Scholar 

  • Levy, A., Lindenbaum, M.: Sequential Karhunen-Loeve basis extraction. IEEE T Image Process. 9(8), 1371–1374 (2000)

    Article  MATH  Google Scholar 

  • Lin, L., Shyu, M.L.: Weighted association rule mining for video semantic detection. Int. J. Multimed. Data Eng. Manag. 1(1), 37–54 (2010)

    Google Scholar 

  • Murakami, H., Kumar, B.V.: Efficient calculation of primary images from a set of images. IEEE Trans. Pattern Anal. Mach. Intell. 4(5), 511–515 (1982)

    Article  Google Scholar 

  • Nenadić O., Greenacre M.J., Correspondence analysis in R, with two- and three-dimensional graphics: the ca package. J Stat Software 20, 1–13 (URL: http://www.jstatsoft.org/v20/i03/) (2007)

    Google Scholar 

  • Oksanen J., Kindt R., Legendre P., O’Hara B., Simpson G.L., Solymos P. et al.: Vegan: Community ecology package (2008)

  • Petrović, S., Bašic, B.D., Morin, A., Zupan, B.: Textual features for corpus visualization using correspondence analysis. Intell. Data. Anal. 13(5), 795–813 (2009)

    Google Scholar 

  • Pham, N.K., Morin, A., Gros, P., Le, Q.T.: Intensive use of correspondence analysis for large scale content-based image retrieval. Stud. Comp. Intell. 292, 57–76 (2010)

    Article  Google Scholar 

  • Rao, C.R.: Maximum likelihood estimation for the multinomial distribution. Sankhya Indian J. Stat. 18(1), 139–148 (1957)

    MATH  Google Scholar 

  • Ross, D., Lim, J., Lin, R.S., Yang, M.H.: Incremental learning for robust visual tracking. Int. J. Comput. Vis. 77, 125–141 (2008)

    Article  Google Scholar 

  • Roweis, S.: EM algorithms for PCA and SPCA. Advances in neural information processing systems, pp. 626–632 (1998)

  • Tipping, M.E., Bishop, C.M.: Probabilistic principal component analysis. J. R. Stat. Soc. Series B Stat. Methodol. 61(3), 611–622 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  • Wickam H.: ggplot2: An implementation of the Grammar of Graphics. R package version 0.8.2 (2009)

  • Wickam, H.: A split-apply-combine strategy for data analysis. J. Stat. Softw. 11(1), 1–29 (2011)

    Google Scholar 

  • Zhu Q., Lin L., Shyu M.L., Chen S.C.: Effective supervised discretization for classification based on correlation Maximization, 12th IEEE international conference on information reuse and integration (IRI 2011), Las Vegas, Nevada, USA, 390–395 (2011)

  • Zou, H., Hastie, T., Tibshirani, R.: Sparse principal component analysis. J. Comput. Graph. Stat 15, 265–286 (2006)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgments

The authors are grateful to all the anonymous reviewers for their helpful comments and suggestions on an earlier version of this paper. Many thanks are due to Yoshio Takane, George Menexes and Hervé Abdi for their help and constructive feedback

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alfonso Iodice D’Enza.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Iodice D’Enza, A., Markos, A. Low-dimensional tracking of association structures in categorical data. Stat Comput 25, 1009–1022 (2015). https://doi.org/10.1007/s11222-014-9470-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11222-014-9470-4

Keywords

Navigation