Skip to main content

AutoSplit: Fast and Scalable Discovery of Hidden Variables in Stream and Multimedia Databases

  • Conference paper
Advances in Knowledge Discovery and Data Mining (PAKDD 2004)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3056))

Included in the following conference series:

Abstract

For discovering hidden (latent) variables in real-world, non-gaussian data streams or an n-dimensional cloud of data points, SVD suffers from its orthogonality constraint. Our proposed method, “AutoSplit”, finds features which are mutually independent and is able to discover non-orthogonal features. Thus, (a) finds more meaningful hidden variables and features, (b) it can easily lead to clustering and segmentation, (c) it surprisingly scales linearly with the database size and (d) it can also operate in on-line, single-pass mode. We also propose “Clustering-AutoSplit”, which extends the feature discovery to multiple feature/bases sets, and leads to clean clustering. Experiments on multiple, real-world data sets show that our method meets all the properties above, outperforming the state-of-the-art SVD.

Supported in part by Japan-U.S. Cooperative Science Program of JSPS; grants from JSPS and MEXT (#15017207, #15300027); the NSF No. IRI-9817496, IIS-9988876, IIS-0113089, IIS-0209107, IIS-0205224, INT-0318547, SENSOR-0329549, EF-0331657; the Pennsylvania Infrastructure Technology Alliance No. 22-901-0001; DARPA No. N66001-00-1-8936; and donations from Intel and Northrop-Grumman.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.A.: Indexing by latent semantic analysis. Journal of the American Society of Information Science 41, 391–497 (1990)

    Article  Google Scholar 

  2. Turk, M., Pentland, A.: Eigenfaces for recognition. Journal of Cognitive Neuroscience 3, 72–86 (1991)

    Article  Google Scholar 

  3. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley, New York (2000)

    Google Scholar 

  4. Jolliffe, I.T.: Principal Component Analysis. Springer, Heidelberg (1986)

    Google Scholar 

  5. Korn, F., Labrinidis, A., Kotidis, Y., Faloutsos, C.: Ratio rules: A new paradigm for fast, quantifiable data mining. In: VLDB (1998)

    Google Scholar 

  6. Garofalakis, M., Gehrke, J., Rastogi, R.: Querying and mining data streams: You only get one look. In: VLDB (2002)

    Google Scholar 

  7. Guha, S., Gunopulos, D., Koudas, N.: Correlating synchronous and asynchronous data streams. In: SIGKDD 2003 (2003)

    Google Scholar 

  8. Kanth, K.V.R., Agrawal, D., Singh, A.K.: Dimensionality reduction for similarity searching in dynamic databases. In: SIGMOD, pp. 166–176 (1998)

    Google Scholar 

  9. Garofalakis, M., Gibbons, P.B.: Wavelet synopses with error guarantees. In: SIGMOD 2002 (2002)

    Google Scholar 

  10. Achlioptas, D.: Database-friendly random projections. In: PODS, pp. 274–281 (2001)

    Google Scholar 

  11. Indyk, P., Koudas, N., Muthukrishnan, S.: Identifying representative trends in massive time series data sets using sketches. In: Proc. VLDB, pp. 363–372 (2000)

    Google Scholar 

  12. Gunopulos, D., Das, G.: Time series similarity measures and time series indexing. In: SIGMOD, p. 624 (2001)

    Google Scholar 

  13. Jensen, C.S., Snodgrass, R.T.: Semantics of time-varying information. Information Systems 19, 33–54 (1994)

    Article  Google Scholar 

  14. Teng, W.G., Chen, M.S., Yu, P.S.: A regression-based temporal pattern mining scheme for data streams. In: VLDB 2003, pp. 93–104 (2003)

    Google Scholar 

  15. Yi, B.K., Sidiropoulos, N.D., Johnson, T., Jagadish, H., Faloutsos, C., Biliris, A.: Online data mining for co-evolving time sequences. In: ICDE (2000)

    Google Scholar 

  16. Jagadish, H., Mendelzon, A., Milo, T.: Similarity-based queries. In: PODS 1995 (1995)

    Google Scholar 

  17. Moon, Y.S., Whang, K.Y., Han, W.S.: General match: a subsequence matching method in time-series databases based on generalized windows. In: SIGMOD 2002, pp. 382–393 (2002)

    Google Scholar 

  18. Keogh, E., Chakrabarti, K., Mehrotra, S., Pazzani, M.: Locally adaptive dimensionality reduction for indexing large time series databases. In: SIGMOD, pp. 151–162 (2001)

    Google Scholar 

  19. Korn, F., Jagadish, H.V., Faloutsos, C.: Efficiently supporting ad hoc queries in large datasets of time sequences. In: Proc. SIGMOD, pp. 289–300 (1997)

    Google Scholar 

  20. Lee, J., Chai, J., Reitsma, P.S.A., Hodgins, J.K., Pollard, N.S.: Interactive control of avatars animated with human motion data. In: SIGGRAPH 2002 (2002)

    Google Scholar 

  21. Hyvarinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. John Wiley & Sons, Chichester (2001)

    Book  Google Scholar 

  22. Lewicki, M.S.: Estimating sub- and super-gaussian densities using ica and exponential power distributions with applications to natural images (2000) (unpublished manuscript)

    Google Scholar 

  23. Wactlar, H., Christel, M., Gong, Y., Hauptmann, A.: Lessons learned from the creation and deployment of a terabyte digital video library. IEEE Computer 32, 66–73 (1999)

    Google Scholar 

  24. Tipping, M., Bishop, C.: Mixture of probabilistic principal component analyzers. Neural Computation (1998)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Pan, JY., Kitagawa, H., Faloutsos, C., Hamamoto, M. (2004). AutoSplit: Fast and Scalable Discovery of Hidden Variables in Stream and Multimedia Databases. In: Dai, H., Srikant, R., Zhang, C. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2004. Lecture Notes in Computer Science(), vol 3056. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24775-3_62

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-24775-3_62

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-22064-0

  • Online ISBN: 978-3-540-24775-3

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics