Skip to main content

Principal Component Analysis for Distributed Data Sets with Updating

  • Conference paper
Advanced Parallel Processing Technologies (APPT 2005)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3756))

Included in the following conference series:

  • 793 Accesses

Abstract

Identifying the patterns of large data sets is a key requirement in data mining. A powerful technique for this purpose is the principal component analysis (PCA). PCA-based clustering algorithms are effective when the data sets are found in the same location. In applications where the large data sets are physically far apart, moving huge amounts of data to a single location can become an impractical, or even impossible, task. A way around this problem was proposed in [10], where truncated singular value decompositions (SVDs) are computed locally and used to reduce the communication costs. Unfortunately, truncated SVDs introduce local approximation errors that could add up and would adversely affect the accuracy of the final PCA. In this paper, we introduce a new method to compute the PCA without incurring local approximation errors. In addition, we consider the situation of updating the PCA when new data arrive at the various locations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Bai, Z., Demmel, J., Dongarra, J., Petitet, A., Robinson, H., Stanley, K.: The Spectral Decomposition of Nonsymmetric Matrices on Distributed Memory Parallel Computers. SIAM J. Sci. Comput. 18(5), 1446–1461 (1997)

    Article  MATH  MathSciNet  Google Scholar 

  2. Boley, D.: Principal Direction Divisive Partitioning. Data Min. Knowl. Discov. 2(4), 325–344 (1998)

    Article  Google Scholar 

  3. Golub, G.H., Van Loan, C.F.: Matrix Computations, 3rd edn. The Johns Hopkins University Press, Baltimore (1996)

    Google Scholar 

  4. Hotelling, H.: Analysis of a Complex of Statistical Variables into Principal Components. J. Educ. Psych. 24, 417–441, 498–520 (1933)

    Article  Google Scholar 

  5. Jackson, J.E.: User’s Guide to Principal Components. Wiley, New York (1991)

    Book  MATH  Google Scholar 

  6. Jolliffe, I.T.: Principal Component Analysis. Springer, Heidelberg (1986)

    Google Scholar 

  7. Kargupta, H., Huang, W.Y., Sivakumar, K., Johnson, E.: Distributed Clustering Using Collective Principal Component Analysis. Knowl. Inf. Syst. 3(4), 422–448 (2001)

    Article  MATH  Google Scholar 

  8. Lee, J.B., Woodyatt, A.S., Berman, M.: Enhancement of High Spectral Resolution Remote Sending Data by a Noise-Adjusted Principal Component Transform. IEEE Trans. Geosci. Remote Sensing 28(3), 295–304 (1990)

    Article  Google Scholar 

  9. Pearson, K.: On Lines and Planes of Closest Fit to Systems of Points in Space. Phil. Mag. 2(6), 559–572 (1901)

    Google Scholar 

  10. Qu, Y.M., Ostrouchov, G., Samatova, N., Geist, A.: Principal Component Analysis for Dimension Reduction in Massive Distributed Data Sets. In: Proceedings to the Second SIAM International Conference on Data Mining (April 2002)

    Google Scholar 

  11. Rabani, E., Toledo, S.: Out-of-Core SVD and QR Decompositions. In: Proceedings of the 10th SIAM Conference on Parallel Processing for Scientific Computing, Norfolk, Virginia (March 2001)

    Google Scholar 

  12. Wegman, E.J.: Huge Data Sets and the Frontiers of Computational Feasibility. J. Comput. Graph. Statist. 4(4), 281–295 (1995)

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Bai, ZJ., Chan, R.H., Luk, F.T. (2005). Principal Component Analysis for Distributed Data Sets with Updating. In: Cao, J., Nejdl, W., Xu, M. (eds) Advanced Parallel Processing Technologies. APPT 2005. Lecture Notes in Computer Science, vol 3756. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11573937_51

Download citation

  • DOI: https://doi.org/10.1007/11573937_51

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-29639-3

  • Online ISBN: 978-3-540-32107-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics