Skip to main content

Efficient Streaming Detection of Hidden Clusters in Big Data Using Subspace Stream Clustering

  • Conference paper
  • First Online:
  • 1025 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8505))

Abstract

Recently, many data mining techniques were revisited to cope with the new big data challenges. Nearly all of these algorithms considered the efficiency of the mining algorithm to survive the increasing size of the data. However, as the dimensionality of the data increases, not only the efficiency but also the effectiveness of traditional mining algorithms is compromised. For instance, clusters hidden in some subspaces are hard to be detected using traditional clustering algorithms, as the dimensionality of the data increases. In this paper, we consider both the huge size, and the high dimensionality of big data by providing a novel solution that presents a three-phase model for subspace stream clustering algorithms. Our novel model, overcomes the huge size of the big data in its first phase, by continuously applying a streaming concept over the huge data objects, and summarizing them into micro-clusters. Then, after each certain batch of data, or after upon a user request, the second phase is applied over the data summarized in micro-clusters, to reconstruct the current distribution of the data out of the current summaries. In the third phase, a subspace clustering algorithm is applied to overcome the high dimensionality of the data, and to find hidden clusters within some subspace. An extensive evaluation study over different scenarios that follow our model over a big data set is performed.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. KDD Cup 1999 Dataset. http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html. Accessed 22 Nov 2013

  2. Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for clustering evolving data streams. In: VLDB ’03, pp. 81–92 (2013)

    Google Scholar 

  3. Aggarwal, C.C., Wolf, J.L., Yu, P.S., Procopiuc, C., Park, J.S.: Fast algorithms for projected clustering. In: ACM SIGMOD Record, vol. 28, pp. 61–72 (1999)

    Google Scholar 

  4. Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. In: SIGMOD ’98, vol. 27, pp. 94–105 (1998)

    Google Scholar 

  5. Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U.: When is nearest neighbor meaningful? In: Beeri, C., Bruneman, P. (eds.) ICDT 1999. LNCS, vol. 1540, pp. 217–235. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  6. Bifet, A., Holmes, G., Kirkby, R., Pfahringer, B.: MOA: massive online analysis. J. Mach. Learn. Res. 99, 1601–1604 (2010)

    Google Scholar 

  7. Cao, F., Ester, M., Qian, W., Zhou, A.: Density-based clustering over an evolving data stream with noise. In: SDM’ 06, pp. 328–339 (2006)

    Google Scholar 

  8. Chen, Y., Tu, L.: Density-based clustering for real-time stream data. In: KDD ’07, pp. 133–142 (2007)

    Google Scholar 

  9. Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD ’96, vol. 96, pp. 226–231 (1996)

    Google Scholar 

  10. Goil, S., Nagesh, H., Choudhary, A.: MAFIA: efficient and scalable subspace clustering for very large data sets. In: KDD ’99, pp. 443–452 (1999)

    Google Scholar 

  11. Hassani, M., Kim, Y., Seidl, T.: Subspace MOA: subspace stream clustering evaluation using the MOA framework. In: DASFAA’ 13, pp. 446–449 (2013)

    Google Scholar 

  12. Hassani, M., Kranen, P., Seidl, T.: Precise anytime clustering of noisy sensor data with logarithmic complexity. In: SensorKDD ’11 Workshop in conj. with KDD ’11, pp. 52–60 (2011)

    Google Scholar 

  13. Hassani, M., Müller, E., Seidl, T.: EDISKCO: energy efficient distributed in-sensor-network k-center clustering with outliers. In: SensorKDD ’09 Workshop in conj. with KDD ’09, pp. 39–48 (2009)

    Google Scholar 

  14. Kailing, K., Kriegel, H.-P., Kröger, P.: Density-connected subspace clustering for high-dimensional data. In: SDM’04, pp. 246–257 (2004)

    Google Scholar 

  15. Moise, G., Sander, J., Ester, M.: P3c: a robust projected clustering algorithm. In: ICDM ’06, pp. 414–425 (2006)

    Google Scholar 

  16. Patrikainen, A., Meila, M.: Comparing subspace clusterings. TKDE 18(7), 902–916 (2006)

    Google Scholar 

Download references

Acknowledgments

This work has been supported by the UMIC Research Centre, RWTH Aachen University, Germany.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marwan Hassani .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Hassani, M., Seidl, T. (2014). Efficient Streaming Detection of Hidden Clusters in Big Data Using Subspace Stream Clustering. In: Han, WS., Lee, M., Muliantara, A., Sanjaya, N., Thalheim, B., Zhou, S. (eds) Database Systems for Advanced Applications. DASFAA 2014. Lecture Notes in Computer Science(), vol 8505. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-43984-5_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-43984-5_11

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-662-43983-8

  • Online ISBN: 978-3-662-43984-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics