Skip to main content

An Adaptive Clustering Approach for Distributed Outlier Detection in Data Streams

  • Conference paper
  • First Online:
Distributed Computing and Artificial Intelligence, 19th International Conference (DCAI 2022)

Abstract

Many real-world problems deal with collections of high-dimensional data, i.e., data with many different features. A dataset exhibiting a high number of features incurs the so-called curse of dimensionality: when the dimensionality increases, the volume of the space increases at a fast rate, causing the sparseness of the data. This makes challenging clustering high-dimensional data for outlier detection purposes. In this paper, we design and implement a distributed peer to peer version of an algorithm that addresses the curse of dimensionality by generating candidate subspaces from the high-dimensional space through Principal Component Analysis. The experimental results show that if the parameters of the distributed algorithm are properly set, then the distributed algorithm converges to the results provided by the sequential algorithm, which is a fundamental and highly desirable property.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. SIGMOD Rec. 27(2), 94–105 (1998). https://doi.org/10.1145/276305.276314

    Article  Google Scholar 

  2. Di Fatta, G., Blasa, F., Cafiero, S., Fortino, G.: Epidemic k-means clustering. In: 2011 IEEE 11th International Conference on Data Mining Workshops, pp. 151–158 (2011)

    Google Scholar 

  3. Elhamifar, E., Vidal, R.: Sparse subspace clustering. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2790–2797 (2009)

    Google Scholar 

  4. Ertöz, L., Steinbach, M., Kumar, V.: Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data, pp. 47–58 (2003). https://doi.org/10.1137/1.9781611972733.5

  5. Jelasity, M., Montresor, A., Babaoglu, O.: Gossip-based aggregation in large dynamic networks. ACM Trans. Comput. Syst. 23(3), 219–252 (2005). https://doi.org/10.1145/1082469.1082470

    Article  Google Scholar 

  6. Kriegel, H.P., Kröger, P., Zimek, A.: Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans. Knowl. Discov. Data 3(1), 1–58 (2009). https://doi.org/10.1145/1497577.1497578

    Article  Google Scholar 

  7. Liu, G., Lin, Z., Yan, S., Sun, J., Yu, Y., Ma, Y.: Robust recovery of subspace structures by low-rank representation. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 171–184 (2013)

    Article  Google Scholar 

  8. NASA: Possible asteroid impacts with earth (2017). https://www.kaggle.com/nasa/asteroid-impacts. Accessed: 2020-01-31

  9. Raj, P.: Predicting a pulsar star (2018). https://www.kaggle.com/pavanraj159/predicting-a-pulsar-star. Accessed: 2019-11-07

  10. Schubert, E., Koos, A., Emrich, T., Züfle, A., Schmid, K.A., Zimek, A.: A framework for clustering uncertain data. Proc. VLDB Endow. 8(12), 1976–1979 (2015). https://doi.org/10.14778/2824032.2824115

    Article  Google Scholar 

  11. Thudumu, S., Branch, P., Jin, J., Singh, J.J.: Adaptive clustering for outlier identification in high-dimensional data. In: Wen, S., Zomaya, A., Yang, L.T. (eds.) Algorithms and Architectures for Parallel Processing, pp. 215–228. Springer International Publishing, Cham (2020)

    Chapter  Google Scholar 

  12. Tomasev, N., Radovanovic, M., Mladenic, D., Ivanovic, M.: The role of hubness in clustering high-dimensional data. IEEE Trans. Knowl. Data Eng. 26(3), 739–751 (2014)

    Article  Google Scholar 

  13. Valcarcel Macua, S., Belanovic, P., Zazo, S.: Consensus-based distributed principal component analysis in wireless sensor networks. In: 2010 IEEE 11th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC), pp. 1–5 (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Andrea Della Monaca .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Monaca, A.D., Cafaro, M., Pulimeno, M., Epicoco, I. (2023). An Adaptive Clustering Approach for Distributed Outlier Detection in Data Streams. In: Omatu, S., Mehmood, R., Sitek, P., Cicerone, S., Rodríguez, S. (eds) Distributed Computing and Artificial Intelligence, 19th International Conference. DCAI 2022. Lecture Notes in Networks and Systems, vol 583. Springer, Cham. https://doi.org/10.1007/978-3-031-20859-1_10

Download citation

Publish with us

Policies and ethics